3
votes

I have a large csv file with 19 columns of character/numeric data.

Upon running fread, I got an error message saying one of my numeric columns was being converted to character because the field had value "". I then opened up my data in a text editor, and found the source of my problem. On one line, a character column read:

"""PARENTS"", ""Y.M."", AND ""EXPECTING"""

Which corresponded to the string:

"PARENTS", "Y.M.", AND "EXPECTING"

As:

  • The first quote is a string protector
  • The 2nd to 6th pairs of quotes are a single quote
  • The last quote is the close for the string protector.

From what I've seen before, fread would read this converting "" to \". The problem in this case is that the string also contains commas. These are being interpreted as delimiters, which messes with my column order and pushes later character columns into my numeric fields.

Is there a way to stop this, or should I use another package?

Note: I have looked around for a solution for this and sense that "" + fread is a source of frustration, but have not seen an example with the added complication of commas.

Reproduce:

Put the following in a txt file:

"A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S"
"168263291","Gruner & Jahr Printing and Publishing Company","Parents Ym and Expecting","""PARENTS"", ""Y.M."", AND ""EXPECTING""",0,0,3,"73130201","055302756","Quad/Graphics Inc.","013034588","02","093671063","000000000","Unclassified","94133","San Francisco","CALIFORNIA","UNITED STATES"

Read the data:

DT <- fread("myfile.csv",colClasses = c(rep("Character",5),
                                        rep("numeric",2),
                                        rep("character",12))
            ,sep = ",")
1
I remember having problems with this last year and it was because fread was under development, and looking at the page it says it still isRorschach
could be good to have update on this ... there is an issue with single quotes in fread as well ... unfortunately for R I am completely moving to pandas & python for this kind of issue which cannot be tolerate in the industry.Colonel Beauvel

1 Answers

1
votes

With the recent fixes to fread() in current devel, v1.9.5, this is what I get:

require(data.table) #v1.9.5+
fread('A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S
"168263291","Gruner & Jahr Printing and Publishing Company","Parents Ym and Expecting","""PARENTS"", ""Y.M."", AND ""EXPECTING""",0,0,3,"73130201","055302756","Quad/Graphics Inc.","013034588","02","093671063","000000000","Unclassified","94133","San Francisco","CALIFORNIA","UNITED STATES"')

#            A                                             B                        C
# 1: 168263291 Gruner & Jahr Printing and Publishing Company Parents Ym and Expecting
                                          D E F G        H         I
# 1: ""PARENTS"", ""Y.M."", AND ""EXPECTING"" 0 0 3 73130201 055302756
                    J         K  L         M         N            O     P
# 1: Quad/Graphics Inc. 013034588 02 093671063 000000000 Unclassified 94133
               Q          R             S
# 1: San Francisco CALIFORNIA UNITED STATES

fread() handles embedded quotes much more robustly, strips spaces by default (new strip.white argument, default=TRUE), and also gained encoding argument. Please see README on project page for up-to-date NEWS.

Let us know if your issue isn't still resolved (either here or on the project page) with a reproducible example.