1
votes

I have been grappling with the following problem for a while, as I need to load in, manipulate, and produce scores from new datasets as quickly as possible. I have defined a data dictionary containing a description of each variable class (e.g. numeric, factor, character, date) and, where applicable, a list of all possible factor levels:

DD <- data.frame(Var = c("a", "b", "c", "d"),
Class = c("Numeric", "Factor", "Factor", "Date"),
Levels = c(NA, "B1, B2, B3", "C1, C2", NA))

Data <- data.frame(a = 5, b = "B1", c = "C2", d = "2015-05-01")

Ultimately, I intend to use model.matrix to produce a design matrix with a common set of indicator variables/ columns regardless of the actual factor levels observed in the particular dataset, so I can score up the data from a particular model.

I need to do these tasks as quickly as possible and, hence, I am trying to find a solution that avoids using lapply/ loops. Here is (a slightly convoluted version of) my existing solution for setting the factor levels, which is currently too slow for my requirements:

lapply(1:ncol(Data[,DD$Class=="Factor"]), function(i) {
    factor( as.character( unlist( Data[,DD$Class=="Factor"][i])) , 
    levels = unlist(strsplit(as.character(DD$Levels[DD$Class=="Factor"][i]), ", ")) )
}) 

Any suggestions for avoiding use of a loop here, if it is even possible, or any alternative solutions would be much appreciated!

Thanks!

1
In short, you're trying to assign predefined levels to dataframe cols according to column name?mtoto
Thanks for the responses guys, and apologies for posting in the wrong place! Yes the main issue I'm facing when attempting to vectorise the data manipulation tasks is simply assigning predefined levels to my factors.Nick Masca

1 Answers

0
votes

Sorry that I don't have enough reputationto add this as a comment.

Can I ask: 1. What's the dimension of your dataset? 2. What's the running time you may satisfy?

You can consider to use Microsoft Open R (Previsouly Revolution R),which optimises basic data manipulation.