1
votes

I am trying to read a rather big file with the read.table.ffdf method from the library ff. Unfortunately, the column-names of this table contain whitespaces, tabs and other special characters. It looks roughly like this (but with ~400 columns):

attribute_1;next attribute;who creates, these horrible) column&nämes
198705;RXBR ;2017-07-05 00:00:00

This isn't pretty, I know, but i am forced to work with this, so I have to set check.names to FALSE.

Furthermore, I am generating a list with the column-class-types which I do like this:

path <- 'path_to_csv-file'
headset <- read.csv(path, sep= ';', dec= '.', header = TRUE, nrows = 2, check.names = FALSE)
#print(headset)
headclasses <- vector(mode = 'character', length = 0)


#heavily simplified version - switch_statement is  in an extra function
for(i in colnames(headset)){
  headclasses[[i]] <- switch (i,
                              'attribute_1' = 'numeric',
                              'next attribute' = 'factor',
                              'who creates, these horrible) column&nämes' = 'POSIXct'
                              )
}
#print(colnames(headset))
#print(headclasses)

Now, if i call:

df <- read.table.ffdf(file=path, levels = NULL, appendLevels = TRUE, FUN = 'read.table', na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 1e4, VERBOSE = TRUE)

I get the following error:

Error in repnam(colClasses, colnames(x), default = NA) : the following argument names do not match'next attribute','(who creates, these horrible column&nämes)'

Why do I get this error? And how can I fix it so that I have the uglier strings as column names?

Note, that in the previous call, check.names is set to FALSE.

My work so far:

1. Trying with proper names but wrong check.names option when calling read.table.ffdf

If I let R choose proper column-names (i.e. check.names = TRUE in the first call to a read-method) and adjust the switch-statement accordingly, I get no error at all (yet a warning) even if I set check.names = FALSE in the read.table.ffdf-method:

headset <- read.csv(path, sep= ';', dec= '.', header = TRUE, nrows = 2)
print(headset)
headclasses <- vector(mode = 'character', length = 0)


#heavily simplified version - switch_statement is  in an extra function
for(i in colnames(headset)){
  headclasses[[i]] <- switch (i,
                              'attribute_1' = 'numeric',
                              'next.attribute' = 'factor',
                              'who.creates..these.horrible..column.nämes' = 'POSIXct'
                              )
}
print(colnames(headset))
print(headclasses)

my_df <- read.table.ffdf(file=path, levels = NULL, appendLevels = TRUE, FUN = 'read.table', na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 2, VERBOSE = TRUE)
print(my_df)
print(colnames(my_df))

"attribute_1" "next.attribute" "who.creates..these.horrible..column.nämes"

Warning message: In read.table(na.strings = c("\N", ""), sep = ";", dec = ".", colClasses > = list( : not all columns named in 'colClasses' exist

So this works, when it shouldn't? Of course, leaving out check.names when calling read.table.ffdf works in the same way, so somewhere something goes missing.

2. Checking source Code of read.table.ffdf

I went to the rdrr.io site (read.table.ffdf-source-code) to check the source code and tried to understand, what I am doing wrong. To cut it short, this is what happens to my file:

rt.args <- list(na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 2)
rt.args$file <- path
asffdf_args <- list()

FUN <- 'read.table'
dat <- do.call(FUN, rt.args)
x <- do.call("as.ffdf", c(list(dat), asffdf_args))
#print(colnames(dat))
#print(colnames(x))

and this yields

"attribute_1" "next attribute" "who creates, these horrible) column&nämes"

"attribute_1" "next.attribute" "who.creates..these.horrible..column.nämes"

Ok, so this is where it goes wrong.

I don't know which asffdf_args to pass and since I am kind of new to R, I am not sure what to look for exactly other than some kind of check.names equivalent. I already had a look at the as.ffdf.data.frame method via

getAnywhere(as.ffdf.data.frame)

but that didn't help me understand what I should put in. So, how can I make read.table.ffdf-work with the uglier column-names? Which 'asffdf_args' do I have to pass to make check.names = FALSE work in said method?

I could adapt my switch-statement (for roughly 400 columns), read the file with check.names = TRUE and after read.table.ffdf is done, I could set the column names to the desired ones (since I have to work with the nastier names later on). But this classifies as a workaround for me and does not satisfy me at all.

This is my first question here, so be gentle with me, if I am overlooking something major and feel free to push me in the right direction.

Thanks in advance for the help.

1
I think the first thing to fix is that colClasses should be a named character vector and not a listRolandASc
If I instantiate headclasses as vector(mode = 'character', length = 0), I still get the same error.changelevel
I have edited my snippets to accommodate for this change.changelevel

1 Answers

0
votes

As is, you probably cannot pass arguments the way you would like to.

as.ffdf.data.frame() calls ffdf() on it's last line.
ffdf in turn calls make.names a few times, without checking any arguments.

If you edit ffdf(), and comment out the line vnam <- make.names(vnam, unique = TRUE) towards the very end of the function, then as.ffdf.data.frame() will be able to retain your funky column names.
I am not providing the modified version of ffdf as the function is more than 300 lines long.

I have tested with a new function ffdf_new, injecting it as follows:

# save original version
orig <- ff::ffdf

# devtools::install_github("miraisolutions/godmode")
godmode:::assignAnywhere("ffdf", ffdf_new)

# simple test below
DF <- data.frame(
  'attribute_1' = 1:10,
  'next attribute' = 3:12,
  'who creates, these horrible) column&nämes' = 11:20,
  check.names = FALSE
)

as.ffdf.data.frame(DF)[["who creates, these horrible) column&nämes"]]
## ff (open) integer length=10 (10)
##  [1]  [2]  [3]  [4]  [5]  [6]  [7]  [8]  [9] [10] 
##   11   12   13   14   15   16   17   18   19   20 

# switch back
godmode:::assignAnywhere("ffdf", orig)