1
votes

I have a very big and sparse matrix, represented as a CSV file (67 GB).

Is it possible to load and work with this matrix in Matlab? I can use a 64bit version on a MAC OS computer, 8GB RAM.

I have read a few posts about this topics but still I am not sure if Matlab 64bit on Mac OS can use the disk space for allocating the matrix or need everything in RAM and, anyway, if the use of such a big portion of disk space can make things almost unusable.

2
Does 67 GB includes all of the zeros? How many non-zero elements are there? What is the overall dimension of the matrix? Somewhere in the 100,000 X 100,000 range?Pursuit
Yes, 67GB is the size of the CSV file including the empty elements; in order to save space, I removed the 0 so 4,0,0,7,8 becomes 4,,,7,8. The size in term of row x col is 230k x 290k. Less than 1/1000 of the matrix (0.1%) is non-empty. Cell values are integer usually < 1000.Eugenio

2 Answers

1
votes

It sounds like memory mapping is the solution for you!

http://www.mathworks.nl/help/matlab/memory-mapping.html

In essence you map the location of a file so you can access it per part (a kind of indexing but then on your harddrive I suppose). After you have done that, depending on the sparsity (sparseness?) of the matrix, you might want to switch to a sparse matrix which hopefully fits in your RAM so you can utilize the speed of RAM and no longer be limited to HDD speeds.

Another solution would be to read the file line by line (or other delimited quantity) and put only the non-zero values in a sparse matrix.

http://www.mathworks.nl/help/matlab/ref/fgetl.html

http://www.mathworks.nl/help/matlab/ref/sparse.html

Kind regards,

Ernst Jan

OK So when using the fgetl solution I get reasonable performance. Approx. 10s per 100 lines on my laptop.

% Start with a clean slate.
clear all
% Create a data file, large!
m = 100; % Rows
n = 230000; % Columns
max_x = 10000;
X=randi(max_x,m,n); 
% Create lots of zeros by setting everything smaller 0.999 x_max to 0;
X(X<0.999*max_x)=0;
% Write data file
csvwrite('csvlist.dat',X);

% Now create a sparse matrix to put the csv file in:
P = sparse(m,n);

% Open data file
FID=fopen('csvlist.dat','r');
% Set line number counter to 0
line_number = 0;
% Get the first line of the data file (230K numbers)
text_line = fgetl(FID);
    % If a text line has been retrieved from the line keep looping!
    tic
    while ischar(text_line)
        % Increase to line_number with 1 (MATLAB index starts at 1..)
        line_number = line_number+1;
        % Analyse the first text line (I assume all integers, otherwise change the format %d to %f)
        C = textscan(text_line,'%d','delimiter',',','EmptyValue', 0);
        % Now the number are stored in cell C. Which we should put in the
        % sparse matrix:
        P(line_number,:)=C{1}; % Can be optimized but forgot how but fast enough for now!
        % And let's get the next line!
        text_line = fgetl(FID);
    end
    toc
    fclose(FID);

So 230k lines should take about 5 to 10 hours.

Kind regards,

Ernst Jan

0
votes

Based on this link I would say that matlab can definitely hold the amount of data stored in the matrix on most computers. (The matrix you describe should even fit into RAM I guess). To find the limits of your computer, use the memory command.

That being said, the hard part in creating this matrix is reading the file. As mentioned in the answer by @EJG89, you may need to use a line by line approach with fgetl as I don't expect higher level commands like dlmread to handle such a huge file in a reasonable way.

If all else fails, just find a way to split the file before processing.