понеділок, 1 лютого 2016 р.

Accessing MS Access from python with Cyrillic table names

My today's issue is quite interesting and is related to the data migration from MS Access to some regular (and more widely-used) database, i.e. MySQL. So we have as input a database, created in the MS Access 2000 (supposed) with the Cyrillic tables, that contains near 5k of sensible records. And expected result should be any DB that is supported by SQL Alchemy (right now it is MySQL, but I believe it will be easy to migrate to any other). And here is the most complex part: "How"?

Firstly I needed some way to provide ODBC  support for my Linux (Windows users/developers are more lucky at this point) as well as driver for access to MDB files. These packages provide such support for Debian:
  • unixodbc
  • odbcinst
  • libodbc1
  • libiodbc2
  • odbc-mdbtools
I got this list of packages from amirkdv, who implemented ODBC module for PHP. [Note: later I found a plenty of other docs about this subject, so I'm not good at Googl'ing at all]. Among other drivers there was also a web-page for unixodbc.org and Easysoft's page, who asked near 500 pounds for their library. Maybe it worth that money, but I need such access for one-time operation, so that's not my case right now.

So final command on my Debian system looked like:
$sudo apt-get install unixodbc odbcinst odbc-mdbtools

Files /etc/odbc.ini and /etc/odbcinst.ini were created after the installation and the latter one contained following lines:
[MDBTools]
Description=MDBTools Driver
Driver=libmdbodbc.so
Setup=libmdbodbc.so
FileUsage=1
UsageCount=1

It was enough for my case. The line "[MDBTools]" in the mentioned above file states for the name of the used driver, that should be indicated for granting access to the database.

At this point system's preparation is over, and it is a time, to look around about libraries that should provide access to the DB from the python. I discovered few of them:
  • pyodbc - outdated and I discovered this a bit later;
  • pypyodbc;
  • sqlalchemy-access - outdated as well.
Unfortunately, pyodbc failed to connect to the database, so I've been forced to use pypyodbc as the only possible way at that moment. Being inspired by usage examples for pypyodbc I finished with something like this: 

import pypyodbc
from os.path import abspath

drv ='MDBTools'
db_file = abspath('./pl.mdb') # Here "pl.mdb" is my input file
con = pypyodbc.connect("DRIVER={};DBQ={}".format(drv, db_file))
cur = con.cursor()

tables = list(cur.tables())

# Get list of tables from the file
for tbl in tables:
  print tbl


And got something like that:
('', '', '\xd0\x92\xd1\x83\xd0\xbb\xd0\xb8\xd1\x86\xd1\x96\xd0\x9b\xd1\x8c\xd0\xb2\xd0\xbe\xd0\xb2\xd0\xb0', 'TABLE', '')
('', '', '\xd0\x9d\xd0\xb0\xd1\x81\xd0\x9f\xd1\x83\xd0\xbd\xd0\xba\xd1\x82\xd0\xb8\xd0\x9b\xd0\x9e', 'TABLE', '')
('', '', '\xd0\xa2\xd0\xb8\xd0\xbf\xd1\x8b', 'TABLE', '')
('', '', '\xd0\xa2\xd0\xbe\xd0\xb2\xd0\xb0\xd1\x80\xd1\x8b', 'TABLE', '')
('', '', '\xd0\xa4\xd1\x96\xd1\x80\xd0\xbc\xd0\xb8\xd0\x9f\xd0\xbe\xd1\x81\xd1\x82\xd0\xb0\xd1\x87\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd0\xb8\xd0\xba\xd0\xb8', 'TABLE', '')

Here I should shout "Hooray! I did it!", but unfortunately, there is no reason for joy. As you may notice, tables are named by Cyrillic names, so now I should deal with them in some way.
The only solution that came into my mind was to call SQL statement as Unicode string, with the table's name encoded as regular string, i.e.:

sql_statement = u'SELECT * FROM "\xd0\x92\xd1\x83\xd0\xbb\xd0\xb8\xd1\x86\xd1\x96\xd0\x9b\xd1\x8c\xd0\xb2\xd0\xbe\xd0\xb2\xd0\xb0"' 
res = cur.execute(sql_statement).fetchall()
for item in res:
  print item

So, I got a list of tuples like this one (stripped version), from which records can be converted into regular Unocode.
('\xd0\x94\xd0\xb6\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbb\xd1\x8c\xd0\xbd\xd0\xb0',)
('\xd0\x94\xd0\xb7\xd0\xb8\xd0\xbd\xd0\xb4\xd1\x80\xd0\xb8 \xd0\x84.',)
('\xd0\x94\xd0\xb8\xd0\xb2\xd1\x96\xd0\xb7\xd1\x96\xd0\xb9\xd0\xbd\xd0\xb0',)
('\xd0\x94\xd0\xb8\xd0\xba\xd1\x82\xd0\xbe\xd0\xb2\xd0\xb0',)
('\xd0\x94\xd0\xb8\xd1\x82\xd1\x8f\xd1\x87\xd0\xb0',)


The converted version of the records is:

Джерельна
Дзиндри Є.
Дивізійна
Диктова
Дитяча

And now it is a time for "Hooray!!!!"