I have been grappling with the following problem for a while, as I need to load in, manipulate, and produce scores from new datasets as quickly as possible. I have defined a data dictionary containing a description of each variable class (e.g. numeric, factor, character, date) and, where applicable, a list of all possible factor levels:
DD <- data.frame(Var = c("a", "b", "c", "d"),
Class = c("Numeric", "Factor", "Factor", "Date"),
Levels = c(NA, "B1, B2, B3", "C1, C2", NA))
Data <- data.frame(a = 5, b = "B1", c = "C2", d = "2015-05-01")
Ultimately, I intend to use model.matrix to produce a design matrix with a common set of indicator variables/ columns regardless of the actual factor levels observed in the particular dataset, so I can score up the data from a particular model.
I need to do these tasks as quickly as possible and, hence, I am trying to find a solution that avoids using lapply/ loops. Here is (a slightly convoluted version of) my existing solution for setting the factor levels, which is currently too slow for my requirements:
lapply(1:ncol(Data[,DD$Class=="Factor"]), function(i) {
factor( as.character( unlist( Data[,DD$Class=="Factor"][i])) ,
levels = unlist(strsplit(as.character(DD$Levels[DD$Class=="Factor"][i]), ", ")) )
})
Any suggestions for avoiding use of a loop here, if it is even possible, or any alternative solutions would be much appreciated!
Thanks!