Row Wise Mode in data.table R

Question

I am trying to find an efficient way to get a row wise modes on a subset of columns in data.table

#Sample data    
a <- data.frame( 
        id=letters[], 
        dattyp1 = sample( 1:2, 26, replace=T) , 
        dattyp2 = sample( 1:2, 26, replace=T) , 
        dattyp3 = sample( 1:2, 26, replace=T) ,
        dattyp4 = sample( 1:2, 26, replace=T) , 
        dattyp5 = sample( 1:2, 26, replace=T) , 
        dattyp6 = sample( 1:2, 26, replace=T)
        )

    library(modeest)
    library(data.table)

I know from: To find "row wise" "Mode" of a given data in R that I can do this:

Mode <- function(x) {
     ux <- unique(x)
          ux[which.max(tabulate(match(x, ux)))]
    }   

apply(a[ ,paste0("dattyp",1:6)], 1, Mode)

But this is really slow (over my millions of records). I am thinking there must be a way to do it with .SDcols - but this does column wise modes not row wise.

a<- data.table( a )
    a[ , lapply(.SD , mfv ), .SDcols=c(paste0("dattyp",1:6) ) ]

Is the example -- 6 binary integer columns with ~1e7 rows-- provided representative of your actual dataset? The optimal solution will probably vary depending on the column types, (i.e. integers can be handled in different ways than character strings), the number of columns, cardinality, and row count. (I took a quick stab at it and couldn't come up with anything faster than your original solution, and fwiw, modeest::mfv() seems to be slower than the user defined function Mode()) — Matt Summersgill
I agree, mfv is slower than the mode function the OP defined. — mgriebe

chinsoon12 chinsoon12 · Accepted Answer · 2018-05-17T00:27:47

I think the fastest way via data.table is still to convert into a relational (i.e. long) format and aggregate and then find max in reldtMtd function as follows. I wonder if using Rcpp will be faster.

data:

library(data.table)
M <- 1e6
popn <- 2
set.seed(0L)
a <- data.frame( 
    id=1:M, 
    dattyp1 = sample(popn, M, replace=TRUE), 
    dattyp2 = sample(popn, M, replace=TRUE), 
    dattyp3 = sample(popn, M, replace=TRUE),
    dattyp4 = sample(popn, M, replace=TRUE), 
    dattyp5 = sample(popn, M, replace=TRUE), 
    dattyp6 = sample(popn, M, replace=TRUE)
)    
setDT(a)

methods:

reldtMtd <- function() {
    melt(a, id.vars="id")[, 
        .N, by=.(id, value)][,
            value[which.max(N)], by=.(id)] 
}

#from https://stackoverflow.com/a/8189441/1989480
Mode <- compiler::cmpfun(function(x) {   
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
})
Mode2 <- compiler::cmpfun(function(x) names(which.max(table(x))))
matA <- as.matrix(a[, -1L])

baseMtd1 <- function() apply(matA, 1, Mode)
baseMtd2 <- function() apply(matA, 1, Mode2)

library(microbenchmark)
microbenchmark(reldtMtd(), baseMtd1(), baseMtd2(), times=3L)

timings:

Unit: seconds
       expr        min         lq       mean     median         uq       max neval
 reldtMtd()   1.882783   1.947515   2.031767   2.012248   2.106259   2.20027     3
 baseMtd1()  15.618716  15.675314  15.809277  15.731913  15.904557  16.07720     3
 baseMtd2() 160.837513 161.692634 162.455048 162.547755 163.263816 163.97988     3

Row Wise Mode in data.table R

2 Answers