1
votes

I have a large (>1GB) CSV file I'm trying to read into a data frame in R.

The non-numeric fields are enclosed in double-quotes so that internal commas are not interpreted as delimiters. That's well and good. However, there are also sometimes unmatched double-quotes in an entry, like "2" Nails".

What is the best way to work around this? My current plan is to use a text processor like awk to relabel the quoting character from the double-quote " to a non-conflicting character like pipe |. My heuristic for finding quoting characters would be double-quotes next to a comma:

gawk '{gsub(/(^\")|(\"$)/,"|");gsub(/,\"/,",|");gsub(/\",/,"|,");print;}' myfile.txt > newfile.txt 

This question is related, but the solution (argument in read.csv of quote="") is not viable for me because my file has non-delimiting commas enclosed in the quotation marks.

2
I don't see how changing all "\"" to "|" would be helpful. I think you might need a regex solution (probably implememted in awk) that will find instances of something like: ","\"[^,]{0,}\"[^,]{0,}\"," , i.e. three double-quotes between commas.IRTFM
@DWin See edit. If I changed the quoting double-quotes to pipes then I could set quote='|' in read.csv.Blue Magister
You could, of course, do that but you will have the same parsing difficulties that you are currently experiencing. Try it on a small test case. In your test case there would be two offending doble quotes, one next to the first dbl-quotes.IRTFM

2 Answers

2
votes

Your idea of looking for quotes next to a comma is probably the best thing you can do; you could however try to turn it around and have the regex escape all the quotes that are not next to a comma (or start/end of line):

Search for

(?<!^|,)"(?!,|$)

and replace all the matches with "".

R might not be the best tool for this because its regex engine doesn't have a multiline mode, but in Perl it would be a one-liner:

$subject =~ s/(?<!^|,)"(?!,|$)/""/mg;
0
votes

This would be a more foolproof variant of Tim's solution, in case non-boundary commas exist inside the cell:

(?<!,\s+)"(?!\s+,$)

I'm not sure if it would have any bugs though.