13
votes

I have panel data (subject/year) for which I would like to only keep subjects who appear the maximum number of times per year. The data set is large so I am using the data.table package. Is there a more elegant solution than what I have tried below?

library(data.table)

DT <- data.table(SUBJECT=c(rep('John',3), rep('Paul',2), 
                           rep('George',3), rep('Ringo',2), 
                           rep('John',2), rep('Paul',4), 
                           rep('George',2), rep('Ringo',4)), 
                 YEAR=c(rep(2011,10), rep(2012,12)), 
                 HEIGHT=rnorm(22), 
                 WEIGHT=rnorm(22))
DT

DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, MAXCOUNT := max(COUNT), by='YEAR']

DT <- DT[COUNT==MAXCOUNT]
DT <- DT[, c('COUNT','MAXCOUNT') := NULL]
DT
1
So essentially you want a data.table with all data from the most data-rich year for each beatle? - Señor O
Come to think of it, it'd be nice if data.table had some of the same by capabilities in the i expression as it does in j expressions. - Señor O

1 Answers

15
votes

I'm not sure you'll view this as elegant but how about :

DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, .SD[COUNT == max(COUNT)], by='YEAR']

That's essentially how to apply by to the i expression as @SenorO commented. You'll still need [,COUNT:=NULL] afterwards but for one temporary column rather than two.

We do discourage .SD though for speed reasons, but hopefully we'll get to this feature request soon so that advice can be dropped: FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged..

A different approach is as follows. It's faster and idiomatic but may be considered less elegant.

# Create a small aggregate table first. No need to use := on the big table.
i = DT[, .N, by='SUBJECT,YEAR']

# Find the even smaller subset. (Do as much as we can on the small aggregate.)
i = i[, .SD[N==max(N)], by=YEAR]

# Finally join the small subset of key values to the big table
setkey(DT, YEAR, SUBJECT)
DT[i]

Something similar is here.