I am trying to find the most frequent value within a group for several factor variables while summarizing a data frame in dplyr. I need a formula that does the following:
- Find the most frequently used factor level among all factors for one variable in a group (so basically "max()" for counts of factor levels).
- If there is a tie between several most-used-factor levels, pick any one of those factors-levels.
- Return the factor-level name (not number of counts).
There are several formulas that work. However, those that I could think of are all slow. Those that are fast are not convenient to apply to several variables in a data frame at once. I was wondering if somebody knows a fast method that integrates nicely with dplyr.
I tried the following:
generating sample data (50000 groups with 100 random letters)
z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE))
str(z)
'data.frame': 5000000 obs. of 2 variables:
$ a: int 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 26 levels "A","B","C","D",..: 6 4 14 12 3 19 17 19 15 20 ...
"Clean"-but-slow approach 1
y <- z %>%
group_by(a) %>%
summarise(c = names(table(b))[which.max(table(b))])
user system elapsed
26.772 2.011 29.568
"Clean"-but-slow approach 2
y <- z %>%
group_by(a) %>%
summarise(c = names(which(table(b) == max(table(b)))[1]))
user system elapsed
29.329 2.029 32.361
"Clean"-but-slow approach 3
y <- z %>%
group_by(a) %>%
summarise(c = names(sort(table(b),decreasing = TRUE)[1]))
user system elapsed
35.086 6.905 42.485
"Messy"-but-fast approach
y <- z %>%
group_by(a,b) %>%
summarise(counter = n()) %>%
group_by(a) %>%
filter(counter == max(counter))
y <- y[!duplicated(y$a),]
y <- y$counter <- NULL
user system elapsed
7.061 0.330 7.664