You mention your data dimensions is of 20 million rows with 12 columns but don't mention the number of unique values of "ID". I'm going to assume here as 20,000.
If you're looking for your solution to be both 1) fast and 2) memory-efficient, then Matthew's (or Jeremy's) solution to spell out all the variables would perform better - that is, until unlist(.SD)
is optimised. Basically what would be best is @codoremifa's syntax with the performance of @Matthew's.
The purpose of this post is to illustrate the performance gain one could have by setkey
on the data.table
(of such huge dimensions) before to aggregate (the aspect of which hasn't been covered by the answers at the time of writing).
setkey
is normally used because it's required for join
or fast subset
(based on binary search) yes. But on data dimensions such as yours (which is, safe to say, BIG data), you can benefit A LOT by setting the key. This is because, setkey
sorts the data by your key column, which allows columns to be aggregated upon later to be in contiguous memory locations and therefore very efficient.
There are quite a lot of enhancements in v1.8.11 (the current development version, where setkey
has also gotten a lot faster). So the benchmarks shown here will vary with the current stable version 1.8.10 on CRAN. It's okay if you're not using development version. Hopefully this'll convince you of the usefulness of setkey
and provides you with some things on what to expect in next release.
Okay, on to illustration on data of your dimensions:
Getting some data:
require(data.table)
set.seed(1L)
uval <- 2e4 # unique values in ID
N <- 20e6
DT <- data.table(ID=sample(uval, N, TRUE)) # for simplicity ID is integer
cols <- paste("V", 1:11, sep="")
set(DT, i=NULL, j=cols, value=as.list(1:11))
dim(DT) # 20e6 by 12
Without setting key:
system.time(ans1 <- DT[,
list(val=mean(c(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11))),
by=ID])
# user system elapsed
# 45.587 0.632 46.251
By setting key:
system.time(setkey(DT, ID)) # (note that this'll be much faster on 1.8.11)
# user system elapsed
# 5.872 0.072 5.948
system.time(ans2 <- DT[,
list(val=mean(c(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11))),
by=ID])
# user system elapsed
# 2.164 0.236 2.400
setkey(ans1, ID)
identical(ans1, ans2) # [1] TRUE
You see that, by setting key, you take ~8.4 sec, where as without it's > 40 secs. That's a lot of speed-up.