2
votes

I am trying to find an efficient way to get a row wise modes on a subset of columns in data.table

#Sample data    
a <- data.frame( 
        id=letters[], 
        dattyp1 = sample( 1:2, 26, replace=T) , 
        dattyp2 = sample( 1:2, 26, replace=T) , 
        dattyp3 = sample( 1:2, 26, replace=T) ,
        dattyp4 = sample( 1:2, 26, replace=T) , 
        dattyp5 = sample( 1:2, 26, replace=T) , 
        dattyp6 = sample( 1:2, 26, replace=T)
        )

    library(modeest)
    library(data.table)

I know from: To find "row wise" "Mode" of a given data in R that I can do this:

Mode <- function(x) {
     ux <- unique(x)
          ux[which.max(tabulate(match(x, ux)))]
    }   

apply(a[ ,paste0("dattyp",1:6)], 1, Mode)

But this is really slow (over my millions of records). I am thinking there must be a way to do it with .SDcols - but this does column wise modes not row wise.

a<- data.table( a )
    a[ , lapply(.SD , mfv ), .SDcols=c(paste0("dattyp",1:6) ) ]
2
Is the example -- 6 binary integer columns with ~1e7 rows-- provided representative of your actual dataset? The optimal solution will probably vary depending on the column types, (i.e. integers can be handled in different ways than character strings), the number of columns, cardinality, and row count. (I took a quick stab at it and couldn't come up with anything faster than your original solution, and fwiw, modeest::mfv() seems to be slower than the user defined function Mode()) - Matt Summersgill
I agree, mfv is slower than the mode function the OP defined. - mgriebe

2 Answers

2
votes

I think the fastest way via is still to convert into a relational (i.e. long) format and aggregate and then find max in reldtMtd function as follows. I wonder if using Rcpp will be faster.

data:

library(data.table)
M <- 1e6
popn <- 2
set.seed(0L)
a <- data.frame( 
    id=1:M, 
    dattyp1 = sample(popn, M, replace=TRUE), 
    dattyp2 = sample(popn, M, replace=TRUE), 
    dattyp3 = sample(popn, M, replace=TRUE),
    dattyp4 = sample(popn, M, replace=TRUE), 
    dattyp5 = sample(popn, M, replace=TRUE), 
    dattyp6 = sample(popn, M, replace=TRUE)
)    
setDT(a)

methods:

reldtMtd <- function() {
    melt(a, id.vars="id")[, 
        .N, by=.(id, value)][,
            value[which.max(N)], by=.(id)] 
}

#from https://stackoverflow.com/a/8189441/1989480
Mode <- compiler::cmpfun(function(x) {   
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
})
Mode2 <- compiler::cmpfun(function(x) names(which.max(table(x))))
matA <- as.matrix(a[, -1L])

baseMtd1 <- function() apply(matA, 1, Mode)
baseMtd2 <- function() apply(matA, 1, Mode2)

library(microbenchmark)
microbenchmark(reldtMtd(), baseMtd1(), baseMtd2(), times=3L)

timings:

Unit: seconds
       expr        min         lq       mean     median         uq       max neval
 reldtMtd()   1.882783   1.947515   2.031767   2.012248   2.106259   2.20027     3
 baseMtd1()  15.618716  15.675314  15.809277  15.731913  15.904557  16.07720     3
 baseMtd2() 160.837513 161.692634 162.455048 162.547755 163.263816 163.97988     3
1
votes

You can try this -though I am not sure how much faster it will be. Note, I am grabbing the first number returned by mfv.

library(modeest)
library(data.table)

a <- data.frame( 
  id=letters[], 
  dattyp1 = sample( 1:2, 26, replace=T) , 
  dattyp2 = sample( 1:2, 26, replace=T) , 
  dattyp3 = sample( 1:2, 26, replace=T) ,
  dattyp4 = sample( 1:2, 26, replace=T) , 
  dattyp5 = sample( 1:2, 26, replace=T) , 
  dattyp6 = sample( 1:2, 26, replace=T)
)


a<- data.table( a )

a[ , Mode:=mfv(c(dattyp1,dattyp2,dattyp3,dattyp4,dattyp5,dattyp6))[1],by=id ]

datatable could be faster. Apply:

microbenchmark(apply={
+   apply(a[ ,paste0("dattyp",1:6)], 1, Mode)
+ })
Unit: microseconds
  expr     min      lq     mean  median      uq      max neval
 apply 574.025 591.803 1056.807 624.988 704.396 39236.79   100

datatable by:

microbenchmark({
+   a[ , Mode:=mfv(c(dattyp1,dattyp2,dattyp3,dattyp4,dattyp5,dattyp6))[1],by=id ]
+ })
Unit: milliseconds
                                                                                                       expr     min       lq
 {     a[, `:=`(Mode, mfv(c(dattyp1, dattyp2, dattyp3, dattyp4,          dattyp5, dattyp6))[1]), by = id] } 2.44109 2.748053
     mean   median       uq      max neval
 3.049809 2.898769 3.139559 6.398032   100