9
votes

I want to use data.table to achieve a very simple task for a large dataset.

Calculate mean of val1 and val2 for each ID.

For details, please refer to the attached fake data.

library(data.table)
DT <- data.table(ID = paste0("ID",rep(1:5,each=2)),
      level= rep(c("CTRL","CTRL","ID1","ID2","ID3"),2),
      val1 = 1:10, 
      val2 = rnorm(10))

Here I want to calculate for each ID, the mean of val1 and val2.

Also notice that in each ID, there're different levels. But for each unique ID, I just want one mean incorporating the different levels, val1, and val2.

--- ID | Mean ---

-- ID1 | ...

-- ID2 | ...

-- ID3 | ...

I tried the following code, but it doesn't work.

topagents <- DT[, mean = mean(list(val1,val2)), 
                    by = ID]

but it doesn't work. I know how to do it in reshape2, first melt and then dcast.

But the original dataset is relatively large with 20M rows and 12 fields, it takes quite a long time to do the calculation.

So I prefer to use data.table or dplyr.

4

4 Answers

3
votes

If I understand correct, you have twelve fields and wish to keep hardcoding to a minimum. I'm not quite sure what your intended output is but hopefully it's one of the two results below -

colstomean <- setdiff(colnames(DT),c('ID','level'))

Option 1, mean of each variable values within that ID

DT[, lapply(.SD, mean, na.rm=TRUE), 
   by=ID, 
   .SDcols = colstomean
   ]

Output -

    ID val1        val2
1: ID1  1.5  0.37648090
2: ID2  3.5 -0.55484848
3: ID3  5.5 -0.07326365
4: ID4  7.5 -0.37705525
5: ID5  9.5 -0.08075406

Option 2, mean of all variable values within that ID

DT[, mean(unlist(.SD), na.rm = TRUE), 
    by=ID, 
   .SDcols = colstomean
   ]

Output

    ID        V1
1: ID1 0.9382404
2: ID2 1.4725758
3: ID3 2.7133682
4: ID4 3.5614724
5: ID5 4.7096230
7
votes

Encapsulate the calls to mean in the list, rather than taking the mean of a list, which you can't do:

DT[, j=list(val1=mean(val1), val2=mean(val2)), by=ID]
    ID val1       val2
1: ID1  1.5  0.1389794
2: ID2  3.5  0.3392179
3: ID3  5.5 -0.6336174
4: ID4  7.5  0.9941148
5: ID5  9.5  0.1324782

To get a single value, the mean of the val1 and val2 values, combine these and pass to mean:

DT[, j=list(mean=mean(c(val1,val2))), by=ID]
    ID      mean
1: ID1 0.8194897
2: ID2 1.9196090
3: ID3 2.4331913
4: ID4 4.2470574
5: ID5 4.8162391

Using a list for the single element of j here is an easy way to name the resulting column.

5
votes
topagents <- DT[, mean(c(val1,val2)), by = ID]

mean can only take a vector, it doesn't understand a list.

Your question said "Calculate mean of val1 and val2 for each ID." But based on Mathew's answer maybe you wanted "Calculate means(plural) of val1 and val2 for each ID."?

5
votes

You mention your data dimensions is of 20 million rows with 12 columns but don't mention the number of unique values of "ID". I'm going to assume here as 20,000.

If you're looking for your solution to be both 1) fast and 2) memory-efficient, then Matthew's (or Jeremy's) solution to spell out all the variables would perform better - that is, until unlist(.SD) is optimised. Basically what would be best is @codoremifa's syntax with the performance of @Matthew's.

The purpose of this post is to illustrate the performance gain one could have by setkey on the data.table (of such huge dimensions) before to aggregate (the aspect of which hasn't been covered by the answers at the time of writing).

setkey is normally used because it's required for join or fast subset (based on binary search) yes. But on data dimensions such as yours (which is, safe to say, BIG data), you can benefit A LOT by setting the key. This is because, setkey sorts the data by your key column, which allows columns to be aggregated upon later to be in contiguous memory locations and therefore very efficient.

There are quite a lot of enhancements in v1.8.11 (the current development version, where setkey has also gotten a lot faster). So the benchmarks shown here will vary with the current stable version 1.8.10 on CRAN. It's okay if you're not using development version. Hopefully this'll convince you of the usefulness of setkey and provides you with some things on what to expect in next release.


Okay, on to illustration on data of your dimensions:

Getting some data:

require(data.table)
set.seed(1L)
uval <- 2e4 # unique values in ID
N <- 20e6
DT <- data.table(ID=sample(uval, N, TRUE)) # for simplicity ID is integer
cols <- paste("V", 1:11, sep="")
set(DT, i=NULL, j=cols, value=as.list(1:11))
dim(DT) # 20e6 by 12

Without setting key:

system.time(ans1 <- DT[, 
    list(val=mean(c(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11))), 
by=ID])
#   user  system elapsed
# 45.587   0.632  46.251

By setting key:

system.time(setkey(DT, ID)) # (note that this'll be much faster on 1.8.11)
#   user  system elapsed
#  5.872   0.072   5.948

system.time(ans2 <- DT[, 
    list(val=mean(c(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11))), 
by=ID])
#   user  system elapsed
#  2.164   0.236   2.400

setkey(ans1, ID)
identical(ans1, ans2) # [1] TRUE

You see that, by setting key, you take ~8.4 sec, where as without it's > 40 secs. That's a lot of speed-up.