0
votes

Notice that the R's base command read.csv works such that read.csv(file=fileName, dec=".", sep=",", header=T) while fread does not work in the following demo, having quoted separators inline. We concentrate now on the data.table and fread, read.csv too slow.


I have a csv file that has comma as a field value separators and dot as a decimal point. The file MyFile.csv has field names with commas and dots, such as "Product.Apple.Green,Purple" where double quotes try to separate dots and commas. However, the separation with double quotes does not work with the fread such that

Sys.setlocale('LC_NUMERIC', 'fi_FI.UTF-8')   
fread(file="MyFile.csv", sep=",", dec=".")

so for example the fields

`CustomerID, ProductID, Prod.Sub, "Prod.Sub,feature", A.B.C, "A,B,C,D"` 

is read, with dash - denoting field separation now, as

`CustomerID - ProductID - Prod.Sub - "Prod.Sub - feature" - A.B.C - "A - B - C - D"` 

where "Prod.Sub, feature" is wrongly read as two fields "Prod.Sub - feature" and "A,B,C,D" is wrongly read as "A - B - C - D".

How can I inline escape separators with data.table's fread?

1
Which version of data.table are you using? When I use fread('myfile.csv'), the data is read normally in the latest version (I made a csv-file with the field names you described). Maybe you can include the first line of the csv-file?Jaap
Can't replicate on v1.9.6. fread works as expected on dt= fread( 'CustomerID, ProductID, Prod.Sub, "Prod.Sub,feature", A.B.C, "A,B,C,D" 1,2,3,4,5,6' ) (note there is a new line between the header and data - doesn't show in comments)dww
@Jaap data.table is 1.10.0 version. R's sessionInfo() data here and locale is C/UTF-8/C/C/C/C, could that be a problem if fread uses the location parameters for the separators?hhh
I can't reproduce this with v1.10.0. Could you include the steps to reproduce this problem in your question?Jaap

1 Answers

1
votes

Cool thing about fread, it can use system commands (at least on Unix).

So make use of sed, for example:

fread("sed 's/,\ /| /g' MyFile.csv", sep = "|", dec = ".")

Replace | with any separator that might not conflict.

Note: In your text example, the delimiter is shown as , + (space). Without space the solution does not work!

Also: without space (only ,), the quoted fields get processed by fread correctly using data.table version 1.10.0