Since some time I'm testing a connectivity between an existing MS-Access database and MATLAB. Currently I have the following local configuration (both MATLAB and DB on the same local drive):
MATLAB 2013a (32 bits) and MS Access 2007.
After resolving connection problems with MATLAB 64bits, I moved to 32 bits and the connection works fine now. Connection is done via database toolbox:
conn = database('Test_DB', '', '');
What is very annoying is the execution time. I have compared execution times within MS Access (executing the query directly in the database with run! button) with the times used by MATLAB to execute the query and bring the data with fetch. The difference is almost an order of magnitude.
Typically, I have two big tables (Table1 - 20 columns x 1'000'000 rows and Table2 - 10 columns x 10'000'000 rows). The query is quite simple combining several fields from both tables based on selected date. Inside Access (depending on version 2003 or 2007) it takes roughly between 7 to 10 seconds. When executed from Matlab (the SQL command is exactly the same) it takes between 70 and 75 seconds. I have tried many things to understand what is the issue here, but with no success. If somebody knows about similar issues I would be glad to have some opinions.
To be more specific: Matlab 32 bits ver. 2013a on 64 bits Win 7, i7-3770 with 8GB RAM. For Database Toolbox I use ODBC Microsoft Access Driver 6.01.7601.17632, ODBCJT32.DLL dated 23.12.2011.
The query uses two tables T1 and T2 and looks as follows:
strSQL = [ 'SELECT DISTINCT T1.TF1, T1.SI1, T1.SI2, T2.TF2, T1.DATE1 ' ...
'FROM T2 INNER JOIN T1 ' ...
'ON T2.TF1 = T1.TF1 ' ...
'WHERE (((T1.DATE1)=#', date1, '#));'];
TF1, TF2 are textual fields
SI1, SI2 are numeric (simple) fields
DATE1 is date field
T1 has 7,000,000 rows, 2 text fields, 3 numeric fields, 1 date field
T2 has 13,000 rows, 39 text fields, 12 numeric fields, 1 date field
cellstr
strings in Matlab fragments the memory and you can get bogged down well before your aggregate memory usage hits your physical memory size. To make large datasets fast, avoidingcellstr
is high on the priority list. Do you need to combine the results from all 1200 queries in memory? Or can they be processed individually? Are the string columns high or low cardinality? Maybe they could be replaced by more efficient numeric or @categorical identifiers. Can you include a few rows of sample data in the question? – Andrew Janke