2
votes

I am importing a CSV file that is comma delimited into MATLAB. Each column has quotes around anything I want to consider as text and then a comma.

I am using read_mixed_csv function from the answer to this question to read in the data as a cell: Import CSV file with mixed data types

thisdata = read_mixed_csv(fname, ','); % Reads in the CSV file 
thisdata = regexprep(thisdata, '^"|"$','');

However, since a few of my columns look like this:

"FAIRHOPE, Alabama"
"FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA"
"Daphne-Fairhope-Foley, AL"

MATLAB places everything after a comma into a new column. So

"Daphne-Fairhope-Foley, AL"

Becomes two columns

"Daphne-Fairhope-Foley
AL"

How can I get MATLAB to read in a mixed csv file and not only consider a comma as a delimiter, but also consider the quotation marks? Is there a more automated way of doing it than textscan? If textscan is an option, what would that look like?

Here is a sample of the data I'm trying to read in with the header included:

"State Code","County Code","Site Num","Parameter Code","POC","Latitude","Longitude","Datum","Parameter Name","Sample Duration","Pollutant Standard","Date Local","Units of Measure","Event Type","Observation Count","Observation Percent","Arithmetic Mean","1st Max Value","1st Max Hour","AQI","Method Name","Local Site Name","Address","State Name","County Name","City Name","CBSA Name","Date of Last Change"
"01","003","0010","88101",1,30.498001,-87.881412,"NAD83","PM2.5 - Local Conditions","24 HOUR","PM25 24-hour 2006","2013-01-01","Micrograms/cubic meter (LC)","None",1,100.0,7.3,7.3,0,30,"R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC","FAIRHOPE, Alabama","FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA","Alabama","Baldwin","Fairhope","Daphne-Fairhope-Foley, AL","2014-02-11"
"01","003","0010","88101",1,30.498001,-87.881412,"NAD83","PM2.5 - Local Conditions","24 HOUR","PM25 24-hour 2006","2013-01-04","Micrograms/cubic meter (LC)","None",1,100.0,7.6,7.6,0,32,"R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC","FAIRHOPE, Alabama","FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA","Alabama","Baldwin","Fairhope","Daphne-Fairhope-Foley, AL","2014-02-11"
"01","003","0010","88101",1,30.498001,-87.881412,"NAD83","PM2.5 - Local Conditions","24 HOUR","PM25 24-hour 2006","2013-01-07","Micrograms/cubic meter (LC)","None",1,100.0,8.6,8.6,0,36,"R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC","FAIRHOPE, Alabama","FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA","Alabama","Baldwin","Fairhope","Daphne-Fairhope-Foley, AL","2014-02-11"
"01","003","0010","88101",1,30.498001,-87.881412,"NAD83","PM2.5 - Local Conditions","24 HOUR","PM25 24-hour 2006","2013-01-10","Micrograms/cubic meter (LC)","None",1,100.0,7,7,0,29,"R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC","FAIRHOPE, Alabama","FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA","Alabama","Baldwin","Fairhope","Daphne-Fairhope-Foley, AL","2014-02-11"

*Note: Converting the CSV file to a tab delimited file makes it easier for MATLAB to deal with and circumvents this problem.

2

2 Answers

1
votes

Having a text qualifier (like ") is a little tricky, but the following might work if you ensure that each row of your table will have the same number of columns (and probably no empty ones).

Anything not within the text qualifier must be convertible to a number.

function C = csvmixed(eachLine,delim,textQualifier)
% Outputs cell containing mixed string and numeric data given a delimiter (',') 
% and a text qualifier ('"').  Each line of the delimited file must be loaded into 
% the cell array eachLine, and each line must have the same number of columns.
% 
% Example:
%   fid = fopen('testcsv.txt','r');
%   eachLine = textscan(fid,'%s','Delimiter','\n'); fclose(fid);
%   C = csvmixed(eachLine{1},',','"')

assert(ischar(delim) && numel(delim)==1);
assert(ischar(textQualifier) && numel(textQualifier)==1);

% find strings, as specified by the input qualifier
patternStr = sprintf('"([^"]*)"%c?',delim);
patternStr = strrep(patternStr,'"',textQualifier);
Cstr = regexp(eachLine,patternStr,'tokens');

% find numeric data
patternNum = sprintf('(?<=(,|^))[^%c,a-zA-Z]*(?=(,|$))',textQualifier);
patternNum = strrep(patternNum,',',delim);
Cnum = regexp(eachLine,patternNum,'match','emptymatch');

numCols = cellfun(@numel,Cstr) + cellfun(@numel,Cnum);
assert(nnz(diff(numCols))==0,'Number of columns not consistent.')

% get string extents (begin, start) indexes for each string
strExtents = regexp(eachLine,patternStr,'tokenExtents');

% deal out parsed data for each line
C = cell(numel(eachLine),numCols(1));
for ii = 1:numel(eachLine),
    strBounds = vertcat(strExtents{ii}{:});
    delimLocs = getDelimLocs(eachLine{ii},strBounds,delim);
    strCellMap = getCellMap(strBounds,delimLocs);

    C(ii,strCellMap) = [Cstr{ii}{:}]; % TODO: preallocate
    C(ii,~strCellMap) = num2cell(str2double(Cnum{ii})); % all else must be numeric
end

end

function delimLocs = getDelimLocs(lineText,solidBounds,delim)
    delimCharLocs = strfind(lineText,delim);
    delimLocs = delimCharLocs(~any(bsxfun(@ge,delimCharLocs,solidBounds(:,1)) & ...
        bsxfun(@le,delimCharLocs,solidBounds(:,2)),1));
end

function cellMap = getCellMap(typeBounds,delimLocs)
    cellMap = any(bsxfun(@gt,typeBounds(:,1),[0 delimLocs]) & ...
        bsxfun(@lt,typeBounds(:,1),[delimLocs Inf]), 1);
end

UPDATE: Fix small typos in getDelimLocs. Add preallocation of cell array.

0
votes

Use the file exchange code replaceinfile to replace the strings that have commas in them with a period instead. Use read_mixed_csv from Import CSV file with mixed data types to read in the file. Remove the extra quotes from the strings that are still left.

replaceinfile(', ', '. ', fname); % Replace commas that was inside quotes and not meant to be separated as periods so they don't show up as a new column
thisdata = read_mixed_csv(fname, ','); % Reads in the CSV file (\t for tab)
thisdata = regexprep(thisdata, '^"|"$',''); % Remove quotes from file and only keep the first 28 columns (last two columns are empty)

For replaceinfile.m function: For running the code on Linux, change the first line of the section on Perl to

perlCmd = sprintf('"%s"', '/usr/bin/perl');