1
votes

Referencing and assigning a subset of a matlab dataset appears to be extremely inefficient and possibly scales like rows^2

Example:

alldata is a large dataset of mixed data - say 150,000 rows by 25 columns (integer, boolean and string).

The format for the dataset is:

'format', '%s%u%u%u%u%u%s%s%s%s%s%s%s%u%u%u%u%s%u%s%s%u%s%s%s%s%u%s%u%s%s%s%u%s'

I then convert 2 type integer cols into type boolean

the following subset assignment:

somedata = alldata(1:m,:)

takes >7 sec for m = 10,000 and ridiculous amounts of time for larger values of m. Plotting time vs m shows a m^2 type relationship which is strange, given that copying alldata is nearly instantaneous, as is using functions like sortrows and find. In fact reading the original .csv data file in is faster than the above assignment for large values of m.

Using the profiler, it appears there is a function subref that includes a very slow line that checks for string comparisons to determine unique values within the dataset. Is this related to how the dataset type is stored (i.e. a reference table)? The dataset includes large number of unique string values.

Are their any solutions to extracting a subset of a dataset in matlab? Such as preallocation (how?), or copying the dataset and deleting rows rather than assigning an extract/subset.

I am using a dual core machine with 1.5Gb ram, but task manager reports less than 1Gb of ram in use.

2
Could you give us a snapshot example of your database? Just a few rows and all the columns will do.Jacob
Ummm - the data is... sensitive. It mainly consists of an observation id, several reference id kept as strings, several date fields (stored in strings as I haven't gotten round to working with them yet), two boolean columns, several integer fields (most single integer) and a whole bunch of other string fields (typically less than 20-30 characters. I can give you the actual sequence of variable types if that would help?Vahid
Here is an example of someone else having the same issue: mathworks.com/matlabcentral/newsreader/view_thread/…Vahid

2 Answers

2
votes

I have previously worked with MATLAB's dataset array for large data, unfortunately its true that they do suffer from performance issues. One thing I found which helps speed things up, is to clear the observation names (ObsNames) property

Try the following fix:

%# I assume you have a 'dataset' object
ds = dataset(...);

%# clear the observation names property (It simply a label for each record)
ds.Properties.ObsNames = [];
0
votes

Amro suggested clearing the observation names:

ds.Properties.ObsNames = [];

This results in a massive performance benefit as the subset assignment changes from quadratic (linear when plotted against rows^2) to linear (when plotted against rows) with rows at the minor cost of losing the ObsNames.

Copying a DataSet is near instantaneous, so when combined with clearing the unneeded rows also results in a massive performance improvement, though slightly a less optimal solution (but with no loss of ObsNames). Performance is about 2x slower compared to dropping ObsNames. This only improves by 2% when ObsNames are also dropped.


supporting data

I used a small script to assign a subset rows of a 150,000 x 25 mixed string/integer/boolean dataset generated the following time measurements (seconds).

The memory heap size made no significant difference in performance and was left at 128 MB.

Subref means standard function for subset assignment was used

  • ObsNames=[] means the ObsNames are dropped

  • Delete means dataset was copied and unneeded rows cleared.

Rows, subref, subref&ObsName=[], Delete, Delete&ObsName=[]

8000, 4.19, 2.06, 4.81, 4.72

32000, 57.61, 2.49, 5.26, 5.21

72000, 390.72, 3.21, 6.09, 6.03

128000, ?(*), 4.21, 7.25, 7.19

(*) I gave up on evaluating this value. Based on linear extrapolation against rows^2 I would guess 2000 sec, or half an hour.


Script

clear
load('data'); % load 'alldata' dataset
% alldata.Properties.ObsNames = []; % drop obsnames

tic;
x = ((1:4).^2.*8000);

for h = 1:length(x)
    start = toc;
    somedata = alldata(1:x(h),:);
%     somedata = alldata; 
%     somedata(x(h):end,:) = []; % drop unrequired obs
    t(h) = toc - start;
    clear somedata
    disp([x(h), t(h)]);


end