I was just going through a tremendous headache caused by read_csv messing up my data by substituting content with NA while reading simple and clean csv files.
I’m iterating over multiple large csv files that add up to millions of observations. Some columns contain quite some NA for some variables.
When reading a csv that contains NA in a certain column for the first 1000 + x observations, read_csv populates the entire column with NA and thus, the data is lost for further operations.
The warning message “Warning: x parsing failure” is shown, but as I’m reading multiple files I cannot check this file by file. Still, I would not know an automated fix for the parsing problem indicated also with problems(x)
Using read.csv instead of read_csv does not cause the problem, but it is slow and I run into encoding issues (using different encodings requires too much memory for large files).
An option to overcome this bug is to add a first observation (first row) to your data that contains something for each column, but still I need to read the file first somehow.
See a simplified example below:
##create a dtafrane
df <- data.frame( id = numeric(), string = character(),
stringsAsFactors=FALSE)
##poluate columns
df[1:1500,1] <- seq(1:1500)
df[1500,2] <- "something"
# variable string contains the first value in obs. 1500
df[1500,]
## check the numbers of NA in variable string
sum(is.na(df$string)) # 1499
##write the df
write_csv(df, "df.csv")
##read the df with read_csv and read.csv
df_readr <- read_csv('df.csv')
df_read_standard <- read.csv('df.csv')
##check the number of NA in variable string
sum(is.na(df_readr$string)) #1500
sum(is.na(df_read_standard$string)) #1499
## the read_csv files is all NA for variable string
problems(df_readr) ##What should that tell me? How to fix it?
data.table::fread('df.csv')
too – Tungguess_max
(tip: =Inf) – Eric Lecoutreread_csv
can be faster thanread.csv
is because it can make assumptions about your data. It looks at the first 1000 rows to guess the column types (viaguess_max
) but if there is no data in a column it can't guess what's in that column. Since you seem to know what's supposed to be in the columns, you should use thecol_types=
parameter to tellread_csv
what to expect rather than making it guess. See the?readr::cols
help page to see how to tellread_csv
what it needs to know. – MrFlick