1
votes

I was just going through a tremendous headache caused by read_csv messing up my data by substituting content with NA while reading simple and clean csv files.

I’m iterating over multiple large csv files that add up to millions of observations. Some columns contain quite some NA for some variables.

When reading a csv that contains NA in a certain column for the first 1000 + x observations, read_csv populates the entire column with NA and thus, the data is lost for further operations.

The warning message “Warning: x parsing failure” is shown, but as I’m reading multiple files I cannot check this file by file. Still, I would not know an automated fix for the parsing problem indicated also with problems(x)

Using read.csv instead of read_csv does not cause the problem, but it is slow and I run into encoding issues (using different encodings requires too much memory for large files).

An option to overcome this bug is to add a first observation (first row) to your data that contains something for each column, but still I need to read the file first somehow.

See a simplified example below:

##create a dtafrane
    df <- data.frame( id = numeric(), string = character(), 
                  stringsAsFactors=FALSE)
##poluate columns
    df[1:1500,1] <- seq(1:1500)
    df[1500,2] <- "something" 
# variable string contains the first value in obs. 1500 
    df[1500,]
## check the numbers of NA in variable string
    sum(is.na(df$string))   # 1499
##write the df
    write_csv(df, "df.csv")
##read the df with read_csv and read.csv
    df_readr <- read_csv('df.csv')
    df_read_standard <- read.csv('df.csv')
##check the number of NA in variable string
    sum(is.na(df_readr$string))  #1500
    sum(is.na(df_read_standard$string))  #1499
## the read_csv files is all NA for variable string
    problems(df_readr) ##What should that tell me? How to fix it?
1
You can try data.table::fread('df.csv') tooTung
Have a look at the parameter guess_max (tip: =Inf)Eric Lecoutre
The whole reason read_csv can be faster than read.csv is because it can make assumptions about your data. It looks at the first 1000 rows to guess the column types (via guess_max) but if there is no data in a column it can't guess what's in that column. Since you seem to know what's supposed to be in the columns, you should use the col_types= parameter to tell read_csv what to expect rather than making it guess. See the ?readr::cols help page to see how to tell read_csv what it needs to know.MrFlick

1 Answers

1
votes

Thanks to MrFlick for giving the answering comment on my questions:

The whole reason read_csv can be faster than read.csv is because it can make assumptions about your data. It looks at the first 1000 rows to guess the column types (via guess_max) but if there is no data in a column it can't guess what's in that column. Since you seem to know what's supposed to be in the columns, you should use the col_types= parameter to tell read_csv what to expect rather than making it guess. See the ?readr::cols help page to see how to tell read_csv what it needs to know.

Also guess_max = Inf overcomes the problem, but the speed advantage of read_csv seems to be lost.