Perform and write functions on data using `by` in data.table

Question

I am using data.table in R and trying to create and perform some functions that will do some calculations per group (DT[i, j, by = ....]) but I need to perform functions on the whole dataset within the function. As an example, taking the iris data, I can do the following to get the difference between group and overall means ("deviations"):

library(data.table)
dtIris <- data.table(iris)

# Sample means by group
dtIris[, mean(Petal.Length), by = "Species"]

# Overall sample mean
dtIris[, mean(Petal.Length)]

# Group deviations 
dtIris[, mean(Petal.Length), by = "Species"][, V1] - dtIris[, mean(Petal.Length)]

Alternatively I can make this a little more elegant with an aggregate() to get it into one expression:

# Within a single expression 
dtIris[, aggregate(Petal.Length ~ Species, FUN = mean)[,2] - mean(Petal.Length)]

And popping that into a function

# Create function
dtDeviations <- function(x, by){
  aggregate(x ~ by, FUN = mean)[,2] - mean(x)
}
dtIris[, dtDeviations(Petal.Length, Species)]

My question is, is there a way to make this fit the "data.table-way" such that I could have my function interact with the by argument in the data.table notation and get means before and after grouping? This would mean I could do the above by executing:

dtIris[, dtDeviations(Petal.Length), by = "Species"]

One possible solution would be to have the group means repeated by the length of each group, with a mean of that vector being the overall mean. It seems reasonable that there would be a way to access and act upon the grouped values within the function. This would be akin to

# Reconstructed overall mean
dtIris[, rep(mean(Petal.Length), .N), by = "Species"][, mean(V1)]

when you use by, j will only see each subset of dtIris corresponding to that by. You will need to refer to dtIris to see whole Petal.Length vector — chinsoon12
yes, but in principle it could also see the three group means and counts for each group, allowing the overall mean to be reconstructed, for example: x1 <- rnorm(112, 0, 1); x2 <- rnorm(481, 1, 1); mean(c(x1,x2)); mean(c(rep(mean(x1), length(x1)), rep(mean(x2), length(x2)))) — rg255
when calculating the mean for the first group, the means for the other 2 groups are not calculated yet. what is your use case for such a construct? there are quite a few good suggestions below to calculate at the global level before going down to the grouped level — chinsoon12
The examples in my question also show that I can do this at the global level first, but I want to know if it possible to (and how to) do it inline with the standard dt[i,j,by] syntax... which is what the question says — rg255

sindri_baldur sindri_baldur · Accepted Answer · 2020-04-27T12:58:15

Not sure if you'll find this more elegant but it's another option:

dtIris[, .(sum(Petal.Length), .N), by = "Species"
       ][, V1/N - sum(V1) / sum(N)]

Perform and write functions on data using `by` in data.table

3 Answers