1
votes

Since some time I'm testing a connectivity between an existing MS-Access database and MATLAB. Currently I have the following local configuration (both MATLAB and DB on the same local drive):
MATLAB 2013a (32 bits) and MS Access 2007.
After resolving connection problems with MATLAB 64bits, I moved to 32 bits and the connection works fine now. Connection is done via database toolbox:

conn = database('Test_DB', '', '');

What is very annoying is the execution time. I have compared execution times within MS Access (executing the query directly in the database with run! button) with the times used by MATLAB to execute the query and bring the data with fetch. The difference is almost an order of magnitude.

Typically, I have two big tables (Table1 - 20 columns x 1'000'000 rows and Table2 - 10 columns x 10'000'000 rows). The query is quite simple combining several fields from both tables based on selected date. Inside Access (depending on version 2003 or 2007) it takes roughly between 7 to 10 seconds. When executed from Matlab (the SQL command is exactly the same) it takes between 70 and 75 seconds. I have tried many things to understand what is the issue here, but with no success. If somebody knows about similar issues I would be glad to have some opinions.

To be more specific: Matlab 32 bits ver. 2013a on 64 bits Win 7, i7-3770 with 8GB RAM. For Database Toolbox I use ODBC Microsoft Access Driver 6.01.7601.17632, ODBCJT32.DLL dated 23.12.2011.

The query uses two tables T1 and T2 and looks as follows:

strSQL = [ 'SELECT DISTINCT T1.TF1, T1.SI1, T1.SI2, T2.TF2, T1.DATE1 ' ...
           'FROM T2 INNER JOIN T1 ' ...
              'ON T2.TF1 = T1.TF1 ' ...
           'WHERE (((T1.DATE1)=#', date1, '#));'];   

TF1, TF2 are textual fields
SI1, SI2 are numeric (simple) fields
DATE1 is date field
T1 has 7,000,000 rows, 2 text fields, 3 numeric fields, 1 date field
T2 has 13,000 rows, 39 text fields, 12 numeric fields, 1 date field

1
When you do the query in Access, it's running inside the db application and keeping the results there. When doing it from Matlab, it adds the cost of inter-process data transfer, and conversion of the data to Matlab datatypes. This can be expensive, particularly with strings and dates. Can you include more specific details about the driver, query, and SQL column types involved? Profile your query. And check out this similar question: stackoverflow.com/questions/23244179/… (full disclosure - that's my answer there.)Andrew Janke
Dear Andrew,Thanks a lot for your tips.Chem1967
Dear Andrew, Thanks a lot for your comments and tips. I have added some more specific information and will try to reconfigure the database toolbox.Chem1967
Yeah, you've got strings and dates coming back. Looks like it's inter-process data transfer and data conversion. How many rows in the typical result sets?Andrew Janke
Unfortunately, that slowdown's not surprising: creating lots of cellstr strings in Matlab fragments the memory and you can get bogged down well before your aggregate memory usage hits your physical memory size. To make large datasets fast, avoiding cellstr is high on the priority list. Do you need to combine the results from all 1200 queries in memory? Or can they be processed individually? Are the string columns high or low cardinality? Maybe they could be replaced by more efficient numeric or @categorical identifiers. Can you include a few rows of sample data in the question?Andrew Janke

1 Answers

0
votes

The additional time spent running it from Matlab is probably in transferring the data out of the Access engine and converting it to Matlab datatypes. There's quite an impedance mismatch there, and Matlab doesn't necessarily use the most efficient types for it.

This is slow enough it sounds like you might be using the default cellarray data return format. This is an inefficient format unsuitable for larger data sets. (Or, imho, much of anything.) It stores all columns, including numerics, in a 2-D cell array.

Switch to the structure or new table data return formats using setbdprefs(). That should give you some speedup and help with the memory fragmentation.

setdbprefs('DataReturnFormat', 'table');
conn = database( ... )

(I'm not sure if table is available yet in R2013a; it's new. Try it and see if it works; it's not well documented even in R2014a where it's definitely available.)

At that point, string and date columns will be your major data transfer cost. If you can alter your query or the schema to return numeric identifiers instead, that could speed things up a lot. And if you have low cardinality string columns, you can convert them to @categorical variables to save space once they're inside Matlab.

Retrieving dates as strings is expensive. And you want them to end up as datenums. You may be able to speed this up further by pushing the conversion from SQL DATE to Matlab datenum in to the Access layer by using a conversion expression written in SQL. And in fact, in this query, since you're already fixing T1.DATE1 to a known value in the WHERE clause, don't retrieve it as a column in the query. Just set the column to the known value in the Matlab layer. That'll save you the expense of transferring and converting the date values, which are expensive. Something like this instead.

setdbprefs('DataReturnFormat', 'table');
conn = database('Test_DB', '', '');

myDates = % ... a list of dates as datenums, not strings
for date = myDates
   sql = [ 'SELECT DISTINCT T1.TF1, T1.SI1, T1.SI2, T2.TF2' ...
           'FROM T2 INNER JOIN T1 ' ...
             'ON T2.TF1 = T1.TF1 ' ...
           'WHERE (((T1.DATE1)=#', datestr(date, 'yyyy-mm-dd'), '#));']; 
   curs = fetch(exec(conn, sql));
   t = curs.Data;
   t.DATE1 = repmat(date, [height(t) 1]);
   % ... do stuff with t ...
end

And try using the Native ODBC Connection method. That will save you the added expense of the JDBC-ODBC bridge driver, which is what the plain database('DSN', '', '') connection uses.