I have a large (>1GB) CSV file I'm trying to read into a data frame in R.
The non-numeric fields are enclosed in double-quotes so that internal commas are not interpreted as delimiters. That's well and good. However, there are also sometimes unmatched double-quotes in an entry, like "2" Nails"
.
What is the best way to work around this? My current plan is to use a text processor like awk to relabel the quoting character from the double-quote "
to a non-conflicting character like pipe |
. My heuristic for finding quoting characters would be double-quotes next to a comma:
gawk '{gsub(/(^\")|(\"$)/,"|");gsub(/,\"/,",|");gsub(/\",/,"|,");print;}' myfile.txt > newfile.txt
This question is related, but the solution (argument in read.csv
of quote=""
) is not viable for me because my file has non-delimiting commas enclosed in the quotation marks.
quote='|'
inread.csv
. – Blue Magister