1
votes

I have a problem using the solution to this question:

Why the field separator character must be only one byte?

I have a file with columns delimited with ~~~, and of course read.table fails with the error invalid 'sep' value: must be one byte. I googled and found the above question, which successfully reads the file into a character matrix.

However, I would like to now convert this character matrix into a data frame, assigning the type to each column automatically, with rules determined as if read.table had worked on the original file, e.g. dates and strings get converted to factors, etc.

1
You could use readLines() and then split each line using strsplit() on the ~~~ delimeter. But this would not necessarily format the data as you want it. - Tim Biegeleisen
this is exactly how the other solution works, but it creates a character matrix which I am now struggling to convert. - Alex
Just as.data.frame it, and then cast the columns as you want. - Tim Biegeleisen
I would like to cast the columns automatically, per the rules used in read.table - Alex
why not write out the matrix as a "txt" document with single byte separator, and then read in again with read.table? - Adam Quek

1 Answers

1
votes

read.table has a helper function utils::type.conversion, whose helpfile states:

This is principally a helper function for read.table. Given a character vector, it attempts to convert it to logical, integer, numeric or complex, and failing that converts it to factor unless as.is = TRUE. The first type that can accept all the non-missing values is chosen.

The bit in read.table that calls this function is:

  for (i in (1L:cols)[do]) {
    data[[i]] <- if (is.na(colClasses[i])) 
      type.convert(data[[i]], as.is = as.is[i], dec = dec, 
                   numerals = numerals, na.strings = character(0L))
  ...
  }

where the ellipsis deals with column types configured in the call to read.table.

For my purposes the following is sufficient:

df2 <- do.call(rbind,strsplit(readLines('test.txt'),'~~~',fixed=T))

df2_processed <-
  setNames(
    as.data.frame(lapply(1:ncol(df2), function(i) {
      type.convert(df2[,i])}), stringsAsFactors = FALSE), 
  paste0('v', 1:ncol(df2)))

where test.txt is the following text file:

2015-03-22~~~153.234~~~hello~~~5~~~6
2015-03-22~~~153.234~~~hello~~~5~~~6
2015-03-22~~~153.234~~~hello~~~5~~~6
2015-03-22~~~153.234~~~hello~~~5~~~6
2015-03-22~~~153.234~~~hello~~~5~~~6