I am using data.table in R and trying to create and perform some functions that will do some calculations per group (DT[i, j, by = ....]) but I need to perform functions on the whole dataset within the function. As an example, taking the iris data, I can do the following to get the difference between group and overall means ("deviations"):
library(data.table)
dtIris <- data.table(iris)
# Sample means by group
dtIris[, mean(Petal.Length), by = "Species"]
# Overall sample mean
dtIris[, mean(Petal.Length)]
# Group deviations
dtIris[, mean(Petal.Length), by = "Species"][, V1] - dtIris[, mean(Petal.Length)]
Alternatively I can make this a little more elegant with an aggregate() to get it into one expression:
# Within a single expression
dtIris[, aggregate(Petal.Length ~ Species, FUN = mean)[,2] - mean(Petal.Length)]
And popping that into a function
# Create function
dtDeviations <- function(x, by){
aggregate(x ~ by, FUN = mean)[,2] - mean(x)
}
dtIris[, dtDeviations(Petal.Length, Species)]
My question is, is there a way to make this fit the "data.table-way" such that I could have my function interact with the by argument in the data.table notation and get means before and after grouping? This would mean I could do the above by executing:
dtIris[, dtDeviations(Petal.Length), by = "Species"]
One possible solution would be to have the group means repeated by the length of each group, with a mean of that vector being the overall mean. It seems reasonable that there would be a way to access and act upon the grouped values within the function. This would be akin to
# Reconstructed overall mean
dtIris[, rep(mean(Petal.Length), .N), by = "Species"][, mean(V1)]
by,jwill only see each subset ofdtIriscorresponding to thatby. You will need to refer todtIristo see whole Petal.Length vector - chinsoon12x1 <- rnorm(112, 0, 1); x2 <- rnorm(481, 1, 1); mean(c(x1,x2)); mean(c(rep(mean(x1), length(x1)), rep(mean(x2), length(x2))))- rg255dt[i,j,by]syntax... which is what the question says - rg255