2
votes

I am making strings of unpredictable character sets into table, with expected number of columns. I am having a troublesome time of choosing a proper separator.

For instance, a sample table might look like:

FILENAME: foo.txt

SEPARATOR: "\u00AA"

ROW1,COL1: foo

ROW1,COL2: b,ar

ROW1,COL3: fo;obar

ROW1,COL4: bo\tt

And on.

In R I would give

read.table('foo.txt', sep="\u00AA")

and get

invalid 'sep' value: must be one byte

What separator should I use to avoid conflict with the unpredictable strings? Unicode is accepted up to \u007F, but R interprets anything higher to be multi-byte. Why?

2
Why not use something normal like , and include a quote character like " after you escape all instances of " in your strings? the command line tool sed is super handy for this kind of thing. - Justin
I am going for efficiency. I prefer not to put the strings of interest in quotes, but that is an option to keep in mind. - bfb
The crux of my frustration is that I am writing and reading the table in R and reading the table in python. Using a tab delimited file works great to write in R and read in Python, but R cannot read the tab delimited file. I returns "Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 72373 did not have 11 elements" - bfb
R can read tab separated values perfectly (sead ?read.table). That error may be because of some other malformation in the data. You may inspect that line on the shell using sed -n 72373p filename.txt. - asb
There are certainly 11 elements in line 72373 via visual inspection. Could R be seeing a space instead of a tab? - bfb

2 Answers

2
votes

Figured it out. Thank you for the inspiration.

The key is to set comment.char="" and quote=""

For instance,

read.table('foo', sep="\t", quote="", comment.char="")

returns the proper data.frame.

0
votes

The method of debugging input problems is to first run table(count.fields( 'file.nam')) and oddities <- which(count.fields('file.nam') %in% odd_counts) and then look at either a readLines('fil.nam')[oddities] version or use sed to look at the offending lines. Often the problem is a comment character which by default is"#" and the solution in those cases is to use comment.char="" in the read.delim(.) call.