I'd like to compute the mean and count number of NA across several group combination with a large dataset. This is probably easiest to explain with some test data. I'm using the latest version of R on a Macbook Pro, and the data.table package (data is large, >1M rows). (note: I noticed after posting this that I accidentally used sum() instead of mean() for the "m = " variables below. I haven't edited it because I don't want to re-run everything, and don't think it matters that much)
set.seed(4)
YR = data.table(yr=1962:2015)
ID = data.table(id=10001:11000)
ID2 = data.table(id2 = 20001:20050)
DT <- YR[,as.list(ID), by = yr] # intentional cartesian join
DT <- DT[,as.list(ID2), by = .(yr, id)] # intentional cartesian join
rm("YR","ID","ID2")
# 2.7M obs, now add data
DT[,`:=` (ratio = rep(sample(10),each=27000)+rnorm(nrow(DT)))]
DT <- DT[round(ratio %% 5) == 0, ratio:=NA] # make some of the ratios NA
DT[,`:=` (keep = as.integer(rnorm(nrow(DT)) > 0.7)) ] # add in the indicator variable
# do it again
DT[,`:=` (ratio2 = rep(sample(10),each=27000)+rnorm(nrow(DT)))]
DT <- DT[round(ratio2 %% 4) == 0, ratio2:=NA] # make some of the ratios NA
DT[,`:=` (keep2 = as.integer(rnorm(nrow(DT)) > 0.7)) ] # add in the indicator variable
So, what I have is identifying info (yr, id, id2) and the data I want to summarize: keep1|2, ratio1|2. Specifically by yr-id, I want to compute the average ratio and ratio2 using keep and keep2 (thus compressing id2). I've thought of doing this either by subsetting by keep/keep2 the computing ratio and ratio2 or by matrix multiplication of keep*ratio, keep2*ratio, keep*ratio2 and keep2*ratio2.
First, the way I'm doing this that gets the right answer, but is slow:
system.time(test1 <- DT[,.SD[keep == 1,.(m = sum(ratio,na.rm = TRUE),
nmiss = sum(is.na(ratio)) )
],by=.(yr,id)])
user system elapsed
23.083 0.191 23.319
This works just as well in about the same time. I thought it might be faster to subset the main data first rather than within .SD:
system.time(test2 <- DT[keep == 1,.SD[,.(m = sum(ratio,na.rm = TRUE),
nmiss = sum(is.na(ratio)) )
],by=.(yr,id)])
user system elapsed
23.723 0.208 23.963
The problem with either of these approaches is that I need to do separate computations for each keep variable. Thus I tried this way:
system.time(test3 <- DT[,.SD[,.( m = sum(ratio*keep,na.rm = TRUE),
nmiss = sum(is.na(ratio*keep)) )
],by=.(yr,id)])
user system elapsed
25.997 0.191 26.217
This allows me to put all the formulas together (I could add in ratio*keep2, ratio2*keep and ratio2*keep2) but 1. it is slower and 2. it is not getting the correct number of NAs (see the nmiss column):
> summary(test1)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. :0.000
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.:0.000
Median :1988 Median :10500 Median : 53.828 Median :1.000
Mean :1988 Mean :10500 Mean : 59.653 Mean :1.207
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.:2.000
Max. :2015 Max. :11000 Max. :211.552 Max. :9.000
> summary(test2)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. :0.000
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.:0.000
Median :1988 Median :10500 Median : 53.828 Median :1.000
Mean :1988 Mean :10500 Mean : 59.653 Mean :1.207
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.:2.000
Max. :2015 Max. :11000 Max. :211.552 Max. :9.000
> summary(test3)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. : 0.00
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.: 2.00
Median :1988 Median :10500 Median : 53.828 Median : 4.00
Mean :1988 Mean :10500 Mean : 59.653 Mean : 4.99
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.: 8.00
Max. :2015 Max. :11000 Max. :211.552 Max. :20.00
What is the fastest way to get my 4 combinations of summarized info by yr-id? Right now, I'm using option 1 or 2 repeated twice (once for keep, again for keep2)