I want to take a bunch of descriptive statistics grouped by several hundred grouping vars. I know from How to group data.table by multiple columns? that I can use list( ) in the grouping parameter if I want the stat for a combination of grouping vars. In my case I want the mean for each level of Y than the mean for each level of Z
# example data
set.seed(007)
DF <- data.frame(X=1:50000, Y=sample(c(0,1), 50000, TRUE), Z=sample(0:5, 50000, TRUE))
library(data.table)
DT <- data.table(DF)
# I tried this - but this gives the mean for each combination of Y and Z
DT[, mean(X), by=list(Y, Z)]
# so does this
DT[, mean(X), by=c("Y", "Z")]
# This works but....
out <- lapply( c( "Y","Z") , FUN= function(K){ DT[, mean(X), by=get(K)]})
out <- do.call( rbind, out )
#...but it is really slow.
I have 100 million records and 400+ grouping vars so need something - somewhat efficient. The lapply option adds up to several days of extra processing time
options( digits=15 )
start.time <- Sys.time()
out <- lapply( c( "Y","Z") , FUN= function(K){ DT[, mean(X), by=get(K)]})
end.time <- Sys.time()
time.taken <- end.time - start.time
start.time <- Sys.time()
DT[, mean(X), by=c("Y")]
DT[, mean(X), by=c("Z")]
end.time <- Sys.time()
time.taken2 <- end.time - start.time
time.taken - time.taken2