4
votes

I'm trying to read a large text file (a few million lines) into Matlab. Initially I was using importdata(file_name), which seemed like a concise solution. However I need to use Matlab 7 (yeah I know its old) and it seems importdata isn't supported. As such I tried the following:

while ~feof(fid)    
    fline = fgetl(fid);
    fdata{1,lno} =  fline ;
    lno = lno + 1;
end

But this is really slow. I'm guessing its because its resizing the array on each iteration. Is there a better way of doing this. Bearing in mind the first 20 lines of the input data are string type data and the remainder of the data is 3 to 6 columns of hexadecimal values.

3

3 Answers

5
votes

you will have to do some reshaping, but another option for you will be you could use fread. But as was mentioned this essentially locks you into a rectangular import. So another option would be to use textscan. As I mention in another note, I'm not 100% sure when it was implemented, all I know is you dont have "importdata()"

fid = fopen('textfile.txt')
Out  = textscan(fid,'%s','delimiter',sprintf('\n'));
fclose(fid)

with the use of textscan, you will be able to get a cell array of characters for each line which you can then manipulate however you want. And as I say in my comments, this no longer matters whether the lines are the same length or not. NOW you can parse the cell array more quickly. But as gnovice mentions, and he also does have a very elegant solution, you may have to concern yourself with memory requirements.

The one thing you never want to use in matlab if you can avoid it, is looping structures. They are fast in C/C++ etc, but in matlab, they are the slowest way of getting where you are going.

EDIT: Just looked it up, and it looks like textscan WAS implemented literally in version 7 (R14) so if thats what you have, you should be good to use that.

2
votes

I see two options:

  1. Rather than growing by 1 every single time, you could e.g. double the size of your array only when necessary. This massively reduces the number of reallocations required.
  2. Do a two-pass approach. The first pass simply counts the number of lines, without storing them. The second pass actually fills in the array (which has been preallocated to the correct size).
2
votes

One solution is to read the entire contents of the file as a string of characters with FSCANF, split the string into individual cells at the points where newline characters occur using MAT2CELL, remove extra white space on the ends with STRTRIM, then process the string data in each cell as needed. For example, using this sample text file 'junk.txt':

hi
hello
1 2 3
FF 00 FF
12 A6 22 20 20 20
FF FF FF

The following code will put each line in a cell of a cell array cellData:

>> fid = fopen('junk.txt','r');
>> strData = fscanf(fid,'%c');
>> fclose(fid);
>> nCharPerLine = diff([0 find(strData == char(10)) numel(strData)]);
>> cellData = strtrim(mat2cell(strData,1,nCharPerLine))

cellData = 

    'hi'    'hello'    '1 2 3'    'FF 00 FF'    '12 A6 22 20 20 20'    'FF FF FF'

Now if you want to convert all of the hexadecimal data (lines 3 through 6 in my sample data file) from strings to vectors of numbers, you can use CELLFUN and SSCANF like so:

>> cellData(3:end) = cellfun(@(s) {sscanf(s,'%x',[1 inf])},cellData(3:end));
>> cellData{3:end}    %# Display contents

ans =

     1     2     3

ans =

   255     0   255

ans =

    18   166    34    32    32    32

ans =

   255   255   255

NOTE: Since you are dealing with such large arrays, you will have to be mindful of the amount of memory being used by your variables. The above solution is vectorized, but may take up a lot of memory. You may have to overwrite or clear large variables like strData when you create cellData. Alternatively, you could loop over the elements in nCharPerLine and individually process each segment of the larger string strData into the vectors you need, which you can preallocate now that you know how many lines of data you have (i.e. nDataLines = numel(nCharPerLine)-nHeaderLines;).