Optimize `dplyr` group_by / summarize

Question

I'm looking for a trick / technique to optimize an operation like the following:

library(dplyr)

n <- 1e5

d <- tibble(x=sample(800, n, TRUE),
            y=sample(2000, n, TRUE) %>% as.Date(origin='1970-01-01'),
            z=sample(5, n, TRUE),
            val=runif(n))

system.time({
  y_dp <- d %>%
    group_by(x, y) %>%
    summarize(w = val[which.max(z)])
})
#     user   system  elapsed 
# 1014.918    9.760 1027.845

This is pretty vanilla - group by 2 variables, compute a scalar summary for each group based on another 2 variables.

data.table is able to handle this about 10000x more efficiently for this size of data:

library(data.table)
system.time({
  y_dt <- data.table(d, key=c("x", "y")) %>%
    `[`(, .(w=val[which.max(z)]), by=list(x, y)) %>%
    as_tibble()
})
#    user  system elapsed 
#   0.109   0.003   0.112 

all.equal(y_dt, y_dp)
# TRUE

It presumably can achieve that by indexing (sorting, in this case) based on the keys, then iterating linearly through the structure; dplyr presumably has to construct separate indices into the structure for each combination (x, y).

Pre-sorting by (x, y) doesn't help the dplyr case either, as it doesn't seem to "remember" that the data is sorted by what it's grouping by:

system.time({
  y3 <- d %>%
    arrange(x, y) %>%
    group_by(x, y) %>%
    summarize(w = val[which.max(z)])
})
#     user   system  elapsed 
# 1048.983   13.616 1070.929

Indeed, since the class & attributes of a tibble don't change after sorting, it seems there's no way to leverage the sorting afterwards.

Thoughts?

EDIT: I mistakenly wrote n <- 5e4 when the timings were actually done with n <- 1e5, I just fixed it in an edit. Also, here are my specs:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.6_1/lib/libopenblasp-r0.3.6.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.2

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 compiler_3.6.0   magrittr_1.5     assertthat_0.2.1
 [5] R6_2.4.0         pillar_1.4.2     glue_1.3.1       tibble_2.1.3    
 [9] crayon_1.3.4     Rcpp_1.0.1       pkgconfig_2.0.2  rlang_0.4.0     
[13] purrr_0.3.2

For me, it is taking just 0.209 0.022 0.231 Change the summarize to summarise. Also, is it possible that you loadded plyr library as well. I used set.seed(24) for creating a reproducible example — akrun
Aha - I installed dplyr_0.8.3 in a private library and now I get 0.447 0.050 0.500 too. That's quite dramatic! Release notes for 0.8.3 say "Fixed performance regression introduced in version 0.8.2 (#4458)". — Ken Williams
github.com/tidyverse/dplyr/issues/4458 - I can write this up as an answer unless someone else wants to. — Ken Williams
Fyi, further optimization to grouped var[which.max(x)] in data.table is planned: github.com/Rdatatable/data.table/issues/523 I guess it would work for ordered factors, not sure. — Frank

Ken Williams Ken Williams · Accepted Answer · 2019-07-18T16:10:20

This was caused by a regression in version 0.8.2:

https://github.com/tidyverse/dplyr/issues/4458

The performance penalty is nonlinear, so in issue #4458 above it was 500x, in my example it was 10000x, and in my real data set I would probably have to wait for the heat death of the universe to measure it.

Upgrading to dplyr 0.8.3 fixed the problem for me:

> install.packages('dplyr')
# Installing package into ‘/private/tmp/lib’
# ...

n <- 1e5

d <- tibble(x=sample(800, n, TRUE),
            y=sample(2000, n, TRUE) %>% as.Date(origin='1970-01-01'),
            z=sample(5, n, TRUE),
            val=runif(n))
system.time({
  y_dp <- d %>%
    group_by(x, y) %>%
    summarize(w = val[which.max(z)])
})
#   user  system elapsed 
#  0.447   0.050   0.500

Optimize `dplyr` group_by / summarize

2 Answers