I am in the middle of parsing in a large amount of csv data. The data is rather "dirty" in that I have inconsistent delimiters, spurious characters and format issues that cause problems for read_csv().
My problem here, however, is not the dirtiness of the data, but just trying to understand the parsing errors that read_csv() is giving me. If I can better understand the error messages, I can then do some janitorial work to fix the problem with scripts. The size of the data makes a manual approach intractable.
Here's a minimal example. Suppose I have a csv file like this:
"col_a","col_b","col_c"
"1","a quick","10"
"2","a quick "brown" fox","20"
"3","quick, brown fox","30"
Note that there's spurious quotes around "brown" in the 2nd row. This content goes into a file called "my_data.csv".
When I try to read that file, I get some parsing failures.
> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 2 col_b delimiter or quote b './my_data.csv' file 2 2 col_b delimiter or quote './my_data.csv'
As you can see, the parsing failure has not been "pretty printed". It is ONE LONG LINE of 271 characters.
I can't figure out where to even put linebreaks in the failure message to see where the problem is and what the message is trying to tell me. Moreover, it refers to a "2x5 tibble". What tibble? My data frame is 3x3.
Can someone show me how to format or put linebreaks in the message from read_csv() so I can see how it is detecting the problem?
Yes, I know what the problem is in this particular minimal example. In my actual data I am dealing with large amounts of csv (~1M rows), peppered with inconsistencies that shower me with hundreds of parsing failures. I'd like to setup a workflow for categorizing these and dealing with them programmatically. The first step, I think, is just understanding how to "parse" the parsing failure message.
read.csv
rather thanread_csv
. Your example gives a more sensible answer with the former, so it might work better on your full dataset. – Andrew Gustarread_csv
isn't going to be that helpful. Just like with code, the error messages you get doesn't always address what's really wrong. Maybe try a CSV validator. – MrFlickread_csv
, then you should remove references toread.csv
because that is confusing (they are two completely different functions). The latter of which is from thereadr
package which is part of the "tidyverse" in which tibbles replace data.frames. If you have imbalanced quotes, all the newlines are included in the last open quote so one value may take up many, many lines in your file. There's no way for R to know where to break that up. – MrFlickreadLines
to read each line as a whole, then you can try to detect lines with problems and clean them up. – Gregor Thomas