MATLAB: textscan to parse irregular text, trouble debugging format specifier

Question

I've been browsing stack overflow and the mathworks website trying to come up with a solution for reading an irregularly formatted text file into MATLAB using textscan but have yet to figure out a good solution.

The format of the text file looks as such:

// Reference=MNI

// Citation=Beauregard M, 1998

// Condition=Primed - Unprimed Semantic Category Decision

// Domain=Semantics

// Modality=Visual

// Subjects=13

-55 -25 -23

33 -9 -20

// Citation=Beauregard M, 1998

// Condition=Unprimed Semantic Category Decision - Baseline

// Domain=Semantics

// Modality=Visual

// Subjects=13

0 -73 9

-25 -59 47

0 -14 59

8 -18 63

-21 -90 -11

-24 -4 62

24 -93 -6

-21 15 47

-35 -26 -21

9 13 44

// Citation=Binder J R, 1996

// Condition=Words > Tones - Passive

// Domain=Language Perception

// Modality=Auditory

// Subjects=12

-58.73 -12.05 -4.61

I would like to end up with a cell array that looks like this {nx3 double} {nx1 cellstr} {nx1 cellstr} {nx1 cellstr} {nx1 double}

Where the first element in the array are the 3d coordinates, the second element the citation, the third element the condition, the fourth element the domain, the fifth element the modality and the sixth element the number of subjects.

I would then like to use these cell array to organize the data into a structure to allow for easy indexing of the coordinates by each of the features I extracted from the text file.

I've tried a bunch of things but have only been able to extract out the coordinates as a string and the feature as a single cell array.

Here is how far I have gotten after searching through stack overflow and the mathworks website:

fid = fopen(fullfile(path2proj,path2loc),'r');
data = textscan(fid,'%s %s %s','HeaderLines',1,...
    'delimiter',{...
        sprintf('// '),...
        'Citation=',...
        'Condition=',...
        'Domain=',...
        'Modality=',...
        'Subjects='});

I get the following output with this code:

data =

{16470x1 cell}    {16470x1 cell}    {16470x1 cell}

data{1}(1:20)

ans =

''
''
''
''
''
'-55  -25 -23'
'33   -9  -20'
''
''
''
''
''
'0    -73 9'
'-25  -59 47'
'0    -14 59'
'8    -18 63'
'-21  -90 -11'
'-24  -4  62'
'24   -93 -6'
'-21  15  47'

data{2}(1:20)

ans =

''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''

data{3}(1:20) ans =

'Beauregard M, 1998'
'Primed - Unprimed Semantic Category Decision'
'Semantics'
'Visual'
'13'
''
''
'Beauregard M, 1998'
'Unprimed Semantic Category Decision - Baseline'
'Semantics'
'Visual'
'13'
''
''
''
''
''
''
''
''

Although I can work with the data in this format, it would be nice to understand how to correctly right a format specifier to extract out piece of data into it's own cell array. Does anyone have any dieas?

What is Reference=MNI? Does it repeat in a file, or is it only in the first line of the input file? — Marcin

Marcin Marcin · Accepted Answer · 2014-06-25T01:31:25

Assuming that Reference is only in the first line, you could do the following to obtained the values you want from each section Citation section.

% read the file and split it into sections based on Citation
filecontents = strsplit(fileread('data.txt'), '// Citation');


% iterate through section and extract desired info from each 
% section. We start from i=2, as for i=1 we have 'Reference' line.
for i = 2:numel(filecontents)

    lines = regexp(filecontents{i}, '\n', 'split');

    % remove empty lines   
    lines(find(strcmp(lines, ''))) = [];

    % get values of the fields
    citation  = lines{1};
    condition = get_value(lines{2}, 'Condition');
    domain = get_value(lines{3}, 'Domain'); 
    modality = get_value(lines{4}, 'Modality');
    subjects = get_value(lines{5}, 'Subjects'); 

    coordinates = cellfun(@str2num, lines(6:end), 'UniformOutput', 0)'; 

    % now you can save in some global cell, 
    % display or process the extracted values as you please.

end

where get_value is:

function value = get_value(line, search_for)    
     [tokens, ~] = regexp(line, [search_for, '=(.+)'],'tokens','match');
     value = tokens{1};

Hope this helps.

MATLAB: textscan to parse irregular text, trouble debugging format specifier

1 Answers