0
votes

My question is in response to two issues I encountered in reading a .tsv file published that contains campaign finance data.

First, the file has a null character that terminates input and throws the error 'embedded nul in string: 'NAVARRO b\0\023 POWERS' when using data.table::fread(). I understand that there are a number of potential solutions to this problem but I was hoping to find something within R. Having seen the skipNul option in read.table(), I decided to give it a shot.

That brings me to the second issue: read.table() with reasonable settings (comment.char = "", quote = "", fill = T) is not throwing an error but it is also not detecting the same filesize that data.table::fread() identified (~100k rows with read.table() vs. ~8M rows with data.table::fread()). The fread() answer seems to be more correct as the file size is ~1.5GB and data.table::fread() identifies valid data when reading in rows leading up to where the error seems to be.

Here is a link to the code and output for the issue.

Any ideas on why read.table() is returning such different results? fread() operates by guessing characteristics of the input file but it doesn't seem to be guessing any exotic options that I didn't use in read.table().

Thanks for your help!

NOTE I do not know anything about the file in question other than the source and what information it contains. The source is from the California Secretary of State by the way. At any rate, the file is too large to open in excel or notepad so I haven't been able to visually examine the file besides looking at a handful of rows in R.

1
Use a better text editor to look at the file. I'm assuming that you use Windows, so I'd recommend Notepad++.Roland
Added a data.table tag. Matt Dowle is probably traveling since he's going to talk tomorrow in San Francisco, but maybe one of the other data.table gurus can offer a hypothesis.IRTFM
@Roland Unfortunately, Notepad++ crashed. There is another thread on the subject of better editors - it looks like I can pay for something else.Taylor White
@TaylorWhite, could you please post link to a zipped file? Can't download 1.5GB atm.Arun
@Arun -- I'll upload a new file as soon as I get a chance.Taylor White

1 Answers

0
votes

I couldn't figure out an R way to deal with the issue but I was able to use a python script that relies on pandas:

import pandas as pd
import os

os.chdir(path = "C:/Users/taylor/Dropbox/WildPolicy/Data/Campaign finance/CalAccess/DATA")
receipts_chunked = pd.read_table("RCPT_CD.TSV", sep = "\t", error_bad_lines = False, low_memory = False, chunksize = 5e5)

chunk_num = 0
for chunk in receipts_chunked:
    chunk_num = chunk_num + 1
    file_name = "receipts_chunked_" + str(chunk_num) + ".csv"
    print("Output file:", file_name)
    chunk.to_csv(file_name, sep = ",", index = False)

The problem with this route is that, with 'error_bad_lines = False', problem rows are simply skipped instead of erroring out. There are only a handful of error cases (out of ~8 million rows) but this is still suboptimal obviously.