I am trying to read a rather big file with the read.table.ffdf method from the library ff. Unfortunately, the column-names of this table contain whitespaces, tabs and other special characters. It looks roughly like this (but with ~400 columns):
attribute_1;next attribute;who creates, these horrible) column&nämes
198705;RXBR ;2017-07-05 00:00:00
This isn't pretty, I know, but i am forced to work with this, so I have to set check.names to FALSE.
Furthermore, I am generating a list with the column-class-types which I do like this:
path <- 'path_to_csv-file'
headset <- read.csv(path, sep= ';', dec= '.', header = TRUE, nrows = 2, check.names = FALSE)
#print(headset)
headclasses <- vector(mode = 'character', length = 0)
#heavily simplified version - switch_statement is in an extra function
for(i in colnames(headset)){
headclasses[[i]] <- switch (i,
'attribute_1' = 'numeric',
'next attribute' = 'factor',
'who creates, these horrible) column&nämes' = 'POSIXct'
)
}
#print(colnames(headset))
#print(headclasses)
Now, if i call:
df <- read.table.ffdf(file=path, levels = NULL, appendLevels = TRUE, FUN = 'read.table', na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 1e4, VERBOSE = TRUE)
I get the following error:
Error in repnam(colClasses, colnames(x), default = NA) : the following argument names do not match'next attribute','(who creates, these horrible column&nämes)'
Why do I get this error? And how can I fix it so that I have the uglier strings as column names?
Note, that in the previous call, check.names is set to FALSE.
My work so far:
1. Trying with proper names but wrong check.names option when calling read.table.ffdf
If I let R choose proper column-names (i.e. check.names = TRUE in the first call to a read-method) and adjust the switch-statement accordingly, I get no error at all (yet a warning) even if I set check.names = FALSE in the read.table.ffdf-method:
headset <- read.csv(path, sep= ';', dec= '.', header = TRUE, nrows = 2)
print(headset)
headclasses <- vector(mode = 'character', length = 0)
#heavily simplified version - switch_statement is in an extra function
for(i in colnames(headset)){
headclasses[[i]] <- switch (i,
'attribute_1' = 'numeric',
'next.attribute' = 'factor',
'who.creates..these.horrible..column.nämes' = 'POSIXct'
)
}
print(colnames(headset))
print(headclasses)
my_df <- read.table.ffdf(file=path, levels = NULL, appendLevels = TRUE, FUN = 'read.table', na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 2, VERBOSE = TRUE)
print(my_df)
print(colnames(my_df))
"attribute_1" "next.attribute" "who.creates..these.horrible..column.nämes"
Warning message: In read.table(na.strings = c("\N", ""), sep = ";", dec = ".", colClasses > = list( : not all columns named in 'colClasses' exist
So this works, when it shouldn't? Of course, leaving out check.names when calling read.table.ffdf works in the same way, so somewhere something goes missing.
2. Checking source Code of read.table.ffdf
I went to the rdrr.io site (read.table.ffdf-source-code) to check the source code and tried to understand, what I am doing wrong. To cut it short, this is what happens to my file:
rt.args <- list(na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 2)
rt.args$file <- path
asffdf_args <- list()
FUN <- 'read.table'
dat <- do.call(FUN, rt.args)
x <- do.call("as.ffdf", c(list(dat), asffdf_args))
#print(colnames(dat))
#print(colnames(x))
and this yields
"attribute_1" "next attribute" "who creates, these horrible) column&nämes"
"attribute_1" "next.attribute" "who.creates..these.horrible..column.nämes"
Ok, so this is where it goes wrong.
I don't know which asffdf_args to pass and since I am kind of new to R, I am not sure what to look for exactly other than some kind of check.names equivalent. I already had a look at the as.ffdf.data.frame method via
getAnywhere(as.ffdf.data.frame)
but that didn't help me understand what I should put in. So, how can I make read.table.ffdf-work with the uglier column-names? Which 'asffdf_args' do I have to pass to make check.names = FALSE work in said method?
I could adapt my switch-statement (for roughly 400 columns), read the file with check.names = TRUE and after read.table.ffdf is done, I could set the column names to the desired ones (since I have to work with the nastier names later on). But this classifies as a workaround for me and does not satisfy me at all.
This is my first question here, so be gentle with me, if I am overlooking something major and feel free to push me in the right direction.
Thanks in advance for the help.
colClasses
should be a named character vector and not a list – RolandASc