Reading CSV data where columns have numbers and characters

Question

I am trying to read a large csv data set with 7 million rows using the following code

histca <- data.table::fread("HISTO_CA.csv", header = TRUE)

Some of the columns have odd characters....see example below.

I get the following error code

In fread("HISTO_CA.csv", select = c(1, 237:248), sep = ";", nrows = 1e+06, : Bumped column 239 to type character on data row 198668, field contains '™™?'.

Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' t

How can I import the data and exclude the rows where this problem occurs

Can you provide an example of the data so we can have a look. Have you tried anything else (I'm thinking off the the top of my head the 'colClasses = "character" ' from read.csv) — sorearm
Looking at fread documentation and its colClasses argument, we read: fread will only promote a column to a higher type if colClasses requests it. It won't downgrade a column to a lower type since NAs would result. You have to coerce such columns afterwards yourself, if you really require data loss. So I guess no direct solution with fread. I you want to stick to data.table I would suggest trying to clean the file outside R (are you working on Unix?) — Eric Lecoutre
I would try to read in all columns as character (using colClasses), to clean the data in R and to convert the columns to numeric after cleansing. — Uwe

rosscova rosscova · Accepted Answer · 2016-07-14T10:54:16

As suggested in the comments, you could read in the data as character class, and work from there:

library( data.table )
histca <- fread("HISTO_CA.csv", header = TRUE, colClasses = "character" )

So you'll be bringing all of the text from the csv file into R without change. Then you can filter out any rows where, for example, Col2 won't convert to numeric class:

histca <- subset( histca, is.na( as.numeric( Col2 ) ) )

Alternatively, you can keep those rows, simply replacing the offending values with empty character strings, which will convert to numeric as NA values.

histca[ is.na( as.numeric( Col2 ) ), Col2 := "" ]

( note this will spit back a warning that "NAs introduced", but that's what you want )

Then convert the entire column to numeric:

histca[ , Col2 := as.numeric( Col2 ) ]

Which will now have some NA values.

Reading CSV data where columns have numbers and characters

1 Answers