39
votes

I have a several data sets with 75,000 observations and a type variable that can take on a value 0-4. I want to add five new dummy variables to each data set for all types. The best way I could come up with to do this is as follows:

# For the 'binom' data set create dummy variables for all types in all data sets
binom.dummy.list<-list()
for(i in 0:4){
    binom.dummy.list[[i+1]]<-sapply(binom$type,function(t) ifelse(t==i,1,0))
}

# Add and merge data
binom.dummy.df<-as.data.frame(do.call("cbind",binom.dummy.list))
binom.dummy.df<-transform(binom.dummy.df,id=1:nrow(binom))
binom<-merge(binom,binom.dummy.df,by="id")

While this works, it is incredibly slow (the merge function has even crashed a few times). Is there a more efficient way to do this? Perhaps this functionality is part of a package that I am not familiar with?

8

8 Answers

49
votes

R has a "sub-language" to translate formulas into design matrix, and in the spirit of the language you can take advantage of it. It's fast and concise. Example: you have a cardinal predictor x, a categorical predictor catVar, and a response y.

> binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
> head(binom)
          y          x catVar
1 0.5051653 0.34888390      2
2 0.4868774 0.85005067      2
3 0.3324482 0.58467798      2
4 0.2966733 0.05510749      3
5 0.5695851 0.96237936      1
6 0.8358417 0.06367418      2

You just do

> A <- model.matrix(y ~ x + catVar,binom) 
> head(A)
  (Intercept)          x catVar1 catVar2 catVar3 catVar4
1           1 0.34888390       0       1       0       0
2           1 0.85005067       0       1       0       0
3           1 0.58467798       0       1       0       0
4           1 0.05510749       0       0       1       0
5           1 0.96237936       1       0       0       0
6           1 0.06367418       0       1       0       0

Done.

24
votes

Drew, this is much faster and shouldn't cause any crashes.

> binom <- data.frame(data=runif(1e5),type=sample(0:4,1e5,TRUE))
> for(t in unique(binom$type)) {
+   binom[paste("type",t,sep="")] <- ifelse(binom$type==t,1,0)
+ }
> head(binom)
        data type type2 type4 type1 type3 type0
1 0.11787309    2     1     0     0     0     0
2 0.11884046    4     0     1     0     0     0
3 0.92234950    4     0     1     0     0     0
4 0.44759259    1     0     0     1     0     0
5 0.01669651    2     1     0     0     0     0
6 0.33966184    3     0     0     0     1     0
16
votes

What about using model.matrix()?

> binom <- data.frame(data=runif(1e5),type=sample(0:4,1e5,TRUE))
> head(binom)
       data type
1 0.1412164    2
2 0.8764588    2
3 0.5559061    4
4 0.3890109    3
5 0.8725753    3
6 0.8358100    1
> inds <- model.matrix(~ factor(binom$type) - 1)
> head(inds)
  factor(binom$type)0 factor(binom$type)1 factor(binom$type)2 factor(binom$type)3 factor(binom$type)4
1                   0                   0                   1                   0                   0
2                   0                   0                   1                   0                   0
3                   0                   0                   0                   0                   1
4                   0                   0                   0                   1                   0
5                   0                   0                   0                   1                   0
6                   0                   1                   0                   0                   0
3
votes

If you're open to using the data.table package, mltools has a one_hot() method.

library(data.table)
library(mltools)

binom <- data.table(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
one_hot(binom)

                 y          x catVar_0 catVar_1 catVar_2 catVar_3 catVar_4
     1: 0.90511891 0.83045050        0        0        1        0        0
     2: 0.91375984 0.73273830        0        0        0        1        0
     3: 0.01926608 0.10301409        0        0        1        0        0
     4: 0.48691138 0.24428157        0        1        0        0        0
     5: 0.60660396 0.09132816        0        0        1        0        0
    ---                                                                   
 99996: 0.12908356 0.26157731        0        1        0        0        0
 99997: 0.96397273 0.98959000        0        1        0        0        0
 99998: 0.16818414 0.37460941        1        0        0        0        0
 99999: 0.72610508 0.72055867        1        0        0        0        0
100000: 0.89710998 0.24155507        0        0        0        0        1

Usage

one_hot(dt, cols = "auto", sparsifyNAs = FALSE, 
        naCols = FALSE, dropCols = TRUE,
        dropUnusedLevels = FALSE)

Which column(s) should be one-hot-encoded? cols = "auto" encodes all unordered factor columns. Therefore, the command below is equivalent. This is only important when the data.table contains factors that should not be encoded.

one_hot(binom, cols="catVar")
2
votes

The recipes package can also be quite powerful to do this. The example below is quite verbose but it can be really clean as soon as you add more preprocessing steps.

library(recipes)

binom <- data.frame(y = runif(1e5), 
                    x = runif(1e5),
                    catVar = as.factor(sample(0:4, 1e5, TRUE))) # use the example from gappy
head(binom)

new_data <- recipe(y ~ ., data = binom) %>% 
  step_dummy(catVar) %>% # add dummy variable
  prep(training = binom) %>% # apply the preprocessing steps (could be more than just adding dummy variables)
  bake(newdata = binom) # apply the recipe to new data
head(new_data)

Other step examples are step_scale, step_center, step_pca, etc.

0
votes

ifelse is vectorized, so if I understand your code correctly, you don't need that sapply. And I wouldn't use merge - I would use SQLite or PostgreSQL.

Some sample data would help too :-)

0
votes

The nnet package for single-layer neural networks (which don't understand factors) has a conversion command: class.ind.

0
votes

You can use the package called dummies

binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
head(binom)

          y          x catVar
1 0.4143348 0.09721401      1
2 0.3140782 0.54340539      3
3 0.1262037 0.51820499      2
4 0.7159850 0.13167720      3
5 0.8203528 0.94116026      3
6 0.2169781 0.82020216      1

Solution:

library(dummies)
binom<-dummy.data.frame(binom)
head(binom)

          y          x catVar0 catVar1 catVar2 catVar3 catVar4
1 0.4143348 0.09721401       0       1       0       0       0
2 0.3140782 0.54340539       0       0       0       1       0
3 0.1262037 0.51820499       0       0       1       0       0
4 0.7159850 0.13167720       0       0       0       1       0
5 0.8203528 0.94116026       0       0       0       1       0
6 0.2169781 0.82020216       0       1       0       0       0