7
votes

I am trying to read a dirty CSV file using the fread() function from the data.table package but have a problem with embedded double quotes and commas in the string values, that is, unescaped double quotes present in a quoted field. The following example data illustrates my problem. It consists of 3 lines/rows and 6 columns, with the first line containing the column names:

"SA","SU","CC","CN","POC","PAC"
"NE","R","000","H "B", O","1","8"
"A","A","000","P","E,5","8"

The first problem is in line 2 which has an embedded pair of double quotes and a comma: "H "B", O". The second problem is in line 3 which has a comma within the double quotes: "E,5". I have tried the following:

Attempt 1

library(data.table)
x1 <- fread(file = "example.csv", quote = "\"")

Output:

> x1
     V1 "SA" "SU"   "CC" "CN" "POC" "PAC"
1: "NE"  "R"    0 "H "B"   O"   "1"     8
2:  "A"  "A"    0    "P"   "E    5"     8

Message:

Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.Detected 6 column names but the data has 7 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

Conclusion: the result is incorrect because it adds a new column V1.

Attempt 2

x2 <- fread(file = "example.csv", quote = "")

Output:

> x2
     V1 "SA"  "SU"   "CC" "CN" "POC" "PAC"
1: "NE"  "R" "000" "H "B"   O"   "1"   "8"
2:  "A"  "A" "000"    "P"   "E    5"   "8"

Message:

Detected 6 column names but the data has 7 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

Conclusion: the result is incorrect because it adds a new column V1..

Solution?

What I am looking for is a way to get an output similar to

> x3
   SA SU CC       CN POC PAC
1: NE  R  0 H 'B', O   1   8
2:  A  A  0        P E,5   8

preferably using fread(), but other suggestions are welcome.

1

1 Answers

4
votes

You could try cleaning your data beforehand and replace the double quotes with single quotes.

x = readLines('my_file.csv')
y = gsub('","', "','", x) # replace double quotes for each field
y = gsub('^"|"$', "'", y) # replace trailing and leading double quotes
z = paste(y, collapse='\n') # turn it back into a table for fread to read
df = fread(z, quote="'")
df

   SA SU CC       CN POC PAC
1: NE  R  0 H "B", O   1   8
2:  A  A  0        P E,5   8

I can't confirm that this is efficient since I don't know how big your file is, but it might be a worthwhile approach.