12
votes

I am in the middle of parsing in a large amount of csv data. The data is rather "dirty" in that I have inconsistent delimiters, spurious characters and format issues that cause problems for read_csv().

My problem here, however, is not the dirtiness of the data, but just trying to understand the parsing errors that read_csv() is giving me. If I can better understand the error messages, I can then do some janitorial work to fix the problem with scripts. The size of the data makes a manual approach intractable.

Here's a minimal example. Suppose I have a csv file like this:

"col_a","col_b","col_c"
"1","a quick","10"
"2","a quick "brown" fox","20"
"3","quick, brown fox","30"

Note that there's spurious quotes around "brown" in the 2nd row. This content goes into a file called "my_data.csv".

When I try to read that file, I get some parsing failures.

> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col     row   col           expected actual            file expected   <int> <chr>              <chr>  <chr>           <chr> actual 1     2 col_b delimiter or quote      b './my_data.csv' file 2     2 col_b delimiter or quote        './my_data.csv'

As you can see, the parsing failure has not been "pretty printed". It is ONE LONG LINE of 271 characters.

I can't figure out where to even put linebreaks in the failure message to see where the problem is and what the message is trying to tell me. Moreover, it refers to a "2x5 tibble". What tibble? My data frame is 3x3.

Can someone show me how to format or put linebreaks in the message from read_csv() so I can see how it is detecting the problem?

Yes, I know what the problem is in this particular minimal example. In my actual data I am dealing with large amounts of csv (~1M rows), peppered with inconsistencies that shower me with hundreds of parsing failures. I'd like to setup a workflow for categorizing these and dealing with them programmatically. The first step, I think, is just understanding how to "parse" the parsing failure message.

1
Try it with read.csv rather than read_csv. Your example gives a more sensible answer with the former, so it might work better on your full dataset.Andrew Gustar
If your file isn't a proper CSV file, then read_csv isn't going to be that helpful. Just like with code, the error messages you get doesn't always address what's really wrong. Maybe try a CSV validator.MrFlick
If this is specific to read_csv, then you should remove references to read.csv because that is confusing (they are two completely different functions). The latter of which is from the readr package which is part of the "tidyverse" in which tibbles replace data.frames. If you have imbalanced quotes, all the newlines are included in the last open quote so one value may take up many, many lines in your file. There's no way for R to know where to break that up.MrFlick
At worst you can use readLines to read each line as a whole, then you can try to detect lines with problems and clean them up.Gregor Thomas
Had similar problems a while back and I'd go with @Gregor. You could also try using "sed" which is a bash csv utility with data.table. See: stackoverflow.com/questions/22229109/…detroyejr

1 Answers

16
votes

After taking a breath and looking at the actual documentation, I see there is a way to get the parsing failures from read_csv() in a form that is very usable.

All I had to do to get the parsing failures was to use problems().

> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col     row   col           expected actual            file expected   <int> <chr>              <chr>  <chr>           <chr> actual 1     2 col_b delimiter or quote      b './my_data.csv' file 2     2 col_b delimiter or quote        './my_data.csv'

> parsing_failures <- problems(df)
> parsing_failures
# A tibble: 2 x 5
    row   col           expected actual            file
  <int> <chr>              <chr>  <chr>           <chr>
1     2 col_b delimiter or quote      b './my_data.csv'
2     2 col_b delimiter or quote        './my_data.csv'

Apparently read_csv() associates a tibble containing parsing failure details and this is accessible by passing the result from read_csv to problems().