I'm looking for a trick / technique to optimize an operation like the following:
library(dplyr)
n <- 1e5
d <- tibble(x=sample(800, n, TRUE),
y=sample(2000, n, TRUE) %>% as.Date(origin='1970-01-01'),
z=sample(5, n, TRUE),
val=runif(n))
system.time({
y_dp <- d %>%
group_by(x, y) %>%
summarize(w = val[which.max(z)])
})
# user system elapsed
# 1014.918 9.760 1027.845
This is pretty vanilla - group by 2 variables, compute a scalar summary for each group based on another 2 variables.
data.table
is able to handle this about 10000x more efficiently for this size of data:
library(data.table)
system.time({
y_dt <- data.table(d, key=c("x", "y")) %>%
`[`(, .(w=val[which.max(z)]), by=list(x, y)) %>%
as_tibble()
})
# user system elapsed
# 0.109 0.003 0.112
all.equal(y_dt, y_dp)
# TRUE
It presumably can achieve that by indexing (sorting, in this case) based on the keys, then iterating linearly through the structure; dplyr
presumably has to construct separate indices into the structure for each combination (x, y)
.
Pre-sorting by (x, y)
doesn't help the dplyr
case either, as it doesn't seem to "remember" that the data is sorted by what it's grouping by:
system.time({
y3 <- d %>%
arrange(x, y) %>%
group_by(x, y) %>%
summarize(w = val[which.max(z)])
})
# user system elapsed
# 1048.983 13.616 1070.929
Indeed, since the class & attributes of a tibble don't change after sorting, it seems there's no way to leverage the sorting afterwards.
Thoughts?
EDIT: I mistakenly wrote n <- 5e4 when the timings were actually done with n <- 1e5, I just fixed it in an edit. Also, here are my specs:
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.6_1/lib/libopenblasp-r0.3.6.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.8.2
loaded via a namespace (and not attached):
[1] tidyselect_0.2.5 compiler_3.6.0 magrittr_1.5 assertthat_0.2.1
[5] R6_2.4.0 pillar_1.4.2 glue_1.3.1 tibble_2.1.3
[9] crayon_1.3.4 Rcpp_1.0.1 pkgconfig_2.0.2 rlang_0.4.0
[13] purrr_0.3.2
0.209 0.022 0.231
Change thesummarize
tosummarise
. Also, is it possible that you loaddedplyr
library as well. I usedset.seed(24)
for creating a reproducible example – akrun1.22
elapsed for the first example – Mako212dplyr_0.8.3
in a private library and now I get0.447 0.050 0.500
too. That's quite dramatic! Release notes for 0.8.3 say "Fixed performance regression introduced in version 0.8.2 (#4458)". – Ken Williams