Extract large Matlab dataset subsets

Question

Referencing and assigning a subset of a matlab dataset appears to be extremely inefficient and possibly scales like rows^2

Example:

alldata is a large dataset of mixed data - say 150,000 rows by 25 columns (integer, boolean and string).

The format for the dataset is:

'format', '%s%u%u%u%u%u%s%s%s%s%s%s%s%u%u%u%u%s%u%s%s%u%s%s%s%s%u%s%u%s%s%s%u%s'

I then convert 2 type integer cols into type boolean

the following subset assignment:

somedata = alldata(1:m,:)

takes >7 sec for m = 10,000 and ridiculous amounts of time for larger values of m. Plotting time vs m shows a m^2 type relationship which is strange, given that copying alldata is nearly instantaneous, as is using functions like sortrows and find. In fact reading the original .csv data file in is faster than the above assignment for large values of m.

Using the profiler, it appears there is a function subref that includes a very slow line that checks for string comparisons to determine unique values within the dataset. Is this related to how the dataset type is stored (i.e. a reference table)? The dataset includes large number of unique string values.

Are their any solutions to extracting a subset of a dataset in matlab? Such as preallocation (how?), or copying the dataset and deleting rows rather than assigning an extract/subset.

I am using a dual core machine with 1.5Gb ram, but task manager reports less than 1Gb of ram in use.

Could you give us a snapshot example of your database? Just a few rows and all the columns will do. — Jacob
Ummm - the data is... sensitive. It mainly consists of an observation id, several reference id kept as strings, several date fields (stored in strings as I haven't gotten round to working with them yet), two boolean columns, several integer fields (most single integer) and a whole bunch of other string fields (typically less than 20-30 characters. I can give you the actual sequence of variable types if that would help? — Vahid
Here is an example of someone else having the same issue: mathworks.com/matlabcentral/newsreader/view_thread/… — Vahid

Amro Amro · Accepted Answer · 2010-09-29T02:16:44

I have previously worked with MATLAB's dataset array for large data, unfortunately its true that they do suffer from performance issues. One thing I found which helps speed things up, is to clear the observation names (ObsNames) property

Try the following fix:

%# I assume you have a 'dataset' object
ds = dataset(...);

%# clear the observation names property (It simply a label for each record)
ds.Properties.ObsNames = [];

Extract large Matlab dataset subsets

2 Answers