Fastest way of determining most frequent factor in a grouped data frame in dplyr

Question

I am trying to find the most frequent value within a group for several factor variables while summarizing a data frame in dplyr. I need a formula that does the following:

Find the most frequently used factor level among all factors for one variable in a group (so basically "max()" for counts of factor levels).
If there is a tie between several most-used-factor levels, pick any one of those factors-levels.
Return the factor-level name (not number of counts).

There are several formulas that work. However, those that I could think of are all slow. Those that are fast are not convenient to apply to several variables in a data frame at once. I was wondering if somebody knows a fast method that integrates nicely with dplyr.

I tried the following:

generating sample data (50000 groups with 100 random letters)

z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE))

str(z)
'data.frame':   5000000 obs. of  2 variables:
$ a: int  1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 26 levels "A","B","C","D",..: 6 4 14 12 3 19 17 19 15 20 ...

"Clean"-but-slow approach 1

 y <- z %>% 
    group_by(a) %>% 
    summarise(c = names(table(b))[which.max(table(b))])

user    system  elapsed 
26.772  2.011   29.568

"Clean"-but-slow approach 2

y <- z %>% 
    group_by(a) %>% 
    summarise(c = names(which(table(b) == max(table(b)))[1]))

user    system  elapsed 
29.329  2.029   32.361

"Clean"-but-slow approach 3

y <- z %>% 
    group_by(a) %>% 
    summarise(c = names(sort(table(b),decreasing = TRUE)[1]))

user    system  elapsed 
35.086  6.905   42.485

"Messy"-but-fast approach

y <- z %>% 
     group_by(a,b) %>% 
     summarise(counter = n()) %>% 
     group_by(a) %>% 
     filter(counter == max(counter))
y <- y[!duplicated(y$a),]
y <- y$counter <- NULL

user   system  elapsed 
7.061  0.330   7.664

Steven Beaupré Steven Beaupré · Accepted Answer · 2015-09-24T17:38:50

Here's another option with dplyr:

set.seed(123)
z <- data.frame(a = rep(1:50000,100), 
                b = sample(LETTERS, 5000000, replace = TRUE), 
                stringsAsFactors = FALSE)

a <- z %>% group_by(a, b) %>% summarise(c=n()) %>% filter(row_number(desc(c))==1) %>% .$b 
b <- z %>% group_by(a) %>% summarise(c=names(which(table(b) == max(table(b)))[1])) %>% .$c

We make sure these are equivalent approaches:

> identical(a, b)
#[1] TRUE

Update

As per mentioned by @docendodiscimus, you could also do:

count(z, a, b) %>% slice(which.max(n))

Here are the results on the benchmark:

library(microbenchmark)
mbm <- microbenchmark(
  steven = z %>% group_by(a, b) %>% summarise(c = n()) %>% filter(row_number(desc(c))==1),
  phil = z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])),
  docendo = count(z, a, b) %>% slice(which.max(n)),
  times = 10
)

#Unit: seconds
#    expr       min        lq      mean    median        uq       max neval cld
#  steven  4.752168  4.789564  4.815986  4.813686  4.847964  4.875109    10  b 
#    phil 15.356051 15.378914 15.467534 15.458844 15.533385 15.606690    10   c
# docendo  4.586096  4.611401  4.669375  4.688420  4.702352  4.753583    10 a

Fastest way of determining most frequent factor in a grouped data frame in dplyr

4 Answers