2
votes

I want to compute in R data.table deviations from group means. To do this efficiently, I would want to use the optimised mean function in data.table, but haven't found a way to use it within a composite call (i.e. x - mean(x))?

What I mean is that I can use x[, lapply(.SD, function(x) x - mean(x)), by=id], but I suspect that this approach does not use the optimised version of mean in data.table. Indeed, comparing the speed of:

  1. x[, lapply(.SD, mean), by=id]
  2. x[, lapply(.SD, function(x) mean(x)), by=id]

It turns out that in some cases 1) is 10 times faster than 2)! So how could I use a call like in 1), but this time for a composite function like x -mean(x)? I did not succeed using anonymous call {...} within lapply.

Thanks!

Simulation showing how faster mean versus function(x) mean(x) is:

library(data.table)

T = 50 
N = 20000
set.seed(123)
data_sim <- data.table(A = rnorm(N * T),
                       B1 = sample(c(0,1), N * T, replace = TRUE),
                       B2 = rnorm(N * T),
                       individual = rep(1:N, each = T))

library(microbenchmark)

mean2 <- function(x) mean(x)

microbenchmark(sol1 = data_sim[, lapply(.SD, mean), by=individual],
               sol2 = data_sim[, lapply(.SD, mean2), by=individual],
               sol3 = data_sim[, lapply(.SD, function(x) mean(x)), by=individual],
               dev_mean = data_sim[, lapply(.SD, function(x) x- mean(x)), by=individual],

Results:

|expr     |       min|      mean|       max| neval|
|:--------|---------:|---------:|---------:|-----:|
|sol1     |  17.67686|  18.68033|  21.04078|     5|
|sol2     | 369.69595| 378.91943| 400.77024|     5|
|sol3     | 149.57088| 154.76857| 159.93155|     5|
|dev_mean | 218.44641| 286.00977| 404.06092|     5|
1
see ?GForce and also switch on verbose=TRUE - chinsoon12
Chinsoon's comment explains why you see the speed difference, but I can't figure out how to apply it to this problem. I think if github.com/Rdatatable/data.table/issues/1414 is done, it could be like DT[, mu := mean(x), by=g][, v := x - mu] (except with lapply and Map to iterate over columns), but the mean is not yet optimized with :=. - Frank
Thanks @chinsoon12 for the verbose=TRUE argument, which explains the speed difference! But indeed, not sure how to apply in my context? And I think according to @Frank, there's little hope to use the optimized mean for my problem? Solution then seem is to compute table of group means (use gforce), and bind it back to original table!? This explains why solution in this similar post worked so well!? - Matifou
@Frank, do you want to write your comment as an answer? I think you got the right answer, that it's just not possible. - Matifou

1 Answers

2
votes

Currently, by-group mean optimization (see ?GForce) is not available with :=, though it has been proposed.

Once it is available, something like DT[, mu := mean(x), by=g][, v := x - mu] should work (with lapply and Map inserted when applying to multiple columns).

In the meantime, there may be some speedup from

mDT = DT[, .(mu = mean(x)), by=g]
DT[mDT, on=.(g), mu := i.mu]
DT[, v := x - mu]

... though I'm not sure, since this involves two group-by operations.