4
votes

General programming question, but there might be specific considerations for Matlab.

I will be importing very large data file. Is it better practice/faster/more efficient to import the whole file onto the memory and then divide it into submatrices, or rather to just import every n columns into a new matrix?

My guess is that it would be faster to load it all into the cache and then deal with it, but it's just an uneducated guess.

1
What format is your data file in? Is it a text file, ASCII-delimited numeric file, CSV file? There are specialized handlers for several data types built in to MATLAB that use similar syntax to C in that you open a file stream, read from the file stream, and finally close the file stream. I think one of these would be your best bet.Engineero
Could you specify what you will do with the matrix? If the goal is just to load the matrix into memory and no computation afterwards, I can't see any reason to exploit the cache.Da Kuang
Hello, thank you for the comment, Engineero. I am currently writing my code on the assumption that the data is a CSV. What I am doing is saying data = csvread('filename') and then dividing the data matrix into several matrices. say: matrix_1_2 = data(:,1:2), etc. Is that better than scanning for the first two columns only, saving them, then scanning for the second pair of columns, etc?msmf14
Da Kuang: There will be a lot of matrix manipulation and multiplication. I am guessing the most efficient way is to not even divide the large matrix, but use subsets of it in the calculations (for instance resultingMatrix = data(:,1:n) .* data(:,n+1:2*n), but that will make the code less legible for others)msmf14
Write the code in the simplest and most readable way possible. If the data comfortably fits in main memory, you'll be fine. If it doesn't, you're in a world of hurt, and you'll need a better disk storage format (binary) and blocked algorithms that will operate on batches of the data. You certainly don't want to parse each line of the CSV more than once, which blocking by column would do.Peter

1 Answers

1
votes

From my experience, the best approach is to parse it once, using either csvread (which uses dlmread which uses textscan - so the time penalty is not significant). This is of course given that a very big file is not larger than the amount of free RAM you have. If a very big file is larger than RAM (I just had to parse a 31GB file for example), than I would use fopen, read line by line (or chunks, blocks whatever you prefer) and write these a writable mat file. This way you can write huge files limited by your file system in theory.