0
votes

I'd like to compute the mean and count number of NA across several group combination with a large dataset. This is probably easiest to explain with some test data. I'm using the latest version of R on a Macbook Pro, and the data.table package (data is large, >1M rows). (note: I noticed after posting this that I accidentally used sum() instead of mean() for the "m = " variables below. I haven't edited it because I don't want to re-run everything, and don't think it matters that much)

set.seed(4)
YR = data.table(yr=1962:2015)
ID = data.table(id=10001:11000)
ID2 = data.table(id2 = 20001:20050)
DT <- YR[,as.list(ID), by = yr] # intentional cartesian join
DT <- DT[,as.list(ID2), by = .(yr, id)] # intentional cartesian join
rm("YR","ID","ID2")
# 2.7M obs, now add data
DT[,`:=` (ratio = rep(sample(10),each=27000)+rnorm(nrow(DT)))]
DT <- DT[round(ratio %% 5) == 0, ratio:=NA] # make some of the ratios NA
DT[,`:=` (keep = as.integer(rnorm(nrow(DT)) > 0.7)) ] # add in the indicator variable
# do it again
DT[,`:=` (ratio2 = rep(sample(10),each=27000)+rnorm(nrow(DT)))]
DT <- DT[round(ratio2 %% 4) == 0, ratio2:=NA] # make some of the ratios NA
DT[,`:=` (keep2 = as.integer(rnorm(nrow(DT)) > 0.7)) ] # add in the indicator variable

So, what I have is identifying info (yr, id, id2) and the data I want to summarize: keep1|2, ratio1|2. Specifically by yr-id, I want to compute the average ratio and ratio2 using keep and keep2 (thus compressing id2). I've thought of doing this either by subsetting by keep/keep2 the computing ratio and ratio2 or by matrix multiplication of keep*ratio, keep2*ratio, keep*ratio2 and keep2*ratio2.

First, the way I'm doing this that gets the right answer, but is slow:

system.time(test1 <- DT[,.SD[keep == 1,.(m = sum(ratio,na.rm = TRUE), 
                                nmiss = sum(is.na(ratio)) )
                      ],by=.(yr,id)])
   user  system elapsed 
 23.083   0.191  23.319 

This works just as well in about the same time. I thought it might be faster to subset the main data first rather than within .SD:

system.time(test2 <- DT[keep == 1,.SD[,.(m = sum(ratio,na.rm = TRUE), 
                                nmiss = sum(is.na(ratio)) )
                       ],by=.(yr,id)])
   user  system elapsed 
 23.723   0.208  23.963

The problem with either of these approaches is that I need to do separate computations for each keep variable. Thus I tried this way:

system.time(test3 <- DT[,.SD[,.( m = sum(ratio*keep,na.rm = TRUE),
                                 nmiss = sum(is.na(ratio*keep)) )
                       ],by=.(yr,id)])
   user  system elapsed 
 25.997   0.191  26.217 

This allows me to put all the formulas together (I could add in ratio*keep2, ratio2*keep and ratio2*keep2) but 1. it is slower and 2. it is not getting the correct number of NAs (see the nmiss column):

> summary(test1)
       yr             id              m               nmiss      
 Min.   :1962   Min.   :10001   Min.   : -2.154   Min.   :0.000  
 1st Qu.:1975   1st Qu.:10251   1st Qu.: 30.925   1st Qu.:0.000  
 Median :1988   Median :10500   Median : 53.828   Median :1.000  
 Mean   :1988   Mean   :10500   Mean   : 59.653   Mean   :1.207  
 3rd Qu.:2002   3rd Qu.:10750   3rd Qu.: 85.550   3rd Qu.:2.000  
 Max.   :2015   Max.   :11000   Max.   :211.552   Max.   :9.000  
> summary(test2)
       yr             id              m               nmiss      
 Min.   :1962   Min.   :10001   Min.   : -2.154   Min.   :0.000  
 1st Qu.:1975   1st Qu.:10251   1st Qu.: 30.925   1st Qu.:0.000  
 Median :1988   Median :10500   Median : 53.828   Median :1.000  
 Mean   :1988   Mean   :10500   Mean   : 59.653   Mean   :1.207  
 3rd Qu.:2002   3rd Qu.:10750   3rd Qu.: 85.550   3rd Qu.:2.000  
 Max.   :2015   Max.   :11000   Max.   :211.552   Max.   :9.000  
> summary(test3)
       yr             id              m               nmiss      
 Min.   :1962   Min.   :10001   Min.   : -2.154   Min.   : 0.00  
 1st Qu.:1975   1st Qu.:10251   1st Qu.: 30.925   1st Qu.: 2.00  
 Median :1988   Median :10500   Median : 53.828   Median : 4.00  
 Mean   :1988   Mean   :10500   Mean   : 59.653   Mean   : 4.99  
 3rd Qu.:2002   3rd Qu.:10750   3rd Qu.: 85.550   3rd Qu.: 8.00  
 Max.   :2015   Max.   :11000   Max.   :211.552   Max.   :20.00  

What is the fastest way to get my 4 combinations of summarized info by yr-id? Right now, I'm using option 1 or 2 repeated twice (once for keep, again for keep2)

1

1 Answers

1
votes

You can do summarization directly in expression in j:

# solution A: summarize in `.SD`:
system.time({
    test2 <- DT[keep == 1,
                .SD[, .(m = sum(ratio, na.rm = TRUE),
                        nmiss = sum(is.na(ratio)))],
                by = .(yr, id), verbose = T]
})
#    user  system elapsed 
#  22.359   0.439  22.561

# solution B: summarize directly in j:
system.time({
    test2 <- DT[keep == 1, .(m = sum(ratio, na.rm = T),
                             nmiss = sum(is.na(ratio))),
                by = .(yr, id), verbose = T]
})
#    user  system elapsed 
#   0.118   0.077   0.195

verbose = T is added to show the difference between the two approaches:

for solution A:

lapply optimization is on, j unchanged as '.SD[, list(m = sum(ratio, na.rm = TRUE), nmiss = sum(is.na(ratio)))]' GForce is on, left j unchanged

Old mean optimization is on, left j unchanged.

Making each group and running j (GForce FALSE) ... The result of j is

a named list. It's very inefficient to create the same names over and over again for each group.

When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

collecting discontiguous groups took 0.058s for 54000 groups
eval(j) took 22.487s for 54000 calls 22.521 secs

For solution B:

...

Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec lapply optimization is on, j unchanged as 'list(sum(ratio, na.rm = T), sum(is.na(ratio)))'

GForce is on, left j unchanged

Old mean optimization is on, left j unchanged. Making each group and running j (GForce FALSE) ... collecting discontiguous groups took 0.027s for 54000 groups eval(j) took 0.079s for 54000 calls 0.168 secs

The main difference is that the summarization in B is treated as named list, which is extremely slow when there are many groups (54k groups for this data!). For a similar benchmark of this type see this one.

For the second part(your test3): We didn't filter columns by keep = 1 first. So those NAs where keep != is also counted in nmiss. Therefore, the count of NAs are different.