Grouping and conditions without loop (big data)

Question

I have several observations of the same groups, and for each observation I have a year.

dat = data.frame(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995))

group   year
1   a   2000
2   a   1996
3   a   1975
4   b   2002
5   b   2010
6   b   1980
7   c   1990
8   c   1986
9   c   1995

For each observation, i would like to know if another observation of the same group can be found with given conditions relative to the focal observation. e.g. : "Is there any other observation (than the focal one) that has been done during the last 6 years (starting from the focal year) in the same group".

Ideally the dataframe should be like that

group   year  six_years
1   a   2000          1  # there is another member of group a that is year = 1996 (2000-6 = 1994, this value is inside the threshold)
2   a   1996          0
3   a   1975          0
4   b   2002          0
5   b   2010          0
6   b   1980          0
7   c   1990          1
8   c   1986          0
9   c   1995          1

Basically for each row we should look into the subset of groups, and see if any(dat$year == conditions). It is very easy to do with a for loop, but it's of no use here : the dataframe is massive (several millions of row) and a loop would take forever. I am searching for an efficient way with vectorized functions or a fast package.

Thanks !

Troy Troy · Accepted Answer · 2013-12-03T06:02:45

EDITED

Actually thinking about it you will probably have a lot of recurring year/group combinations, in which case much quicker to pre-calculate the frequencies using count() - which is also a plyr function:

90M rows took ~4sec

require(plyr)
dat <- data.frame(group = sample(c("a","b","c"),size=9000000,replace=TRUE), 
             year = sample(c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),size=9000000,replace=TRUE))


 test<-function(y,g,df){
 d<-df[df$year>=y-6 & 
          df$year<y &
          df$group== g,]
 return(nrow(d))
}

rollup<-function(){
  summ<-count(dat)                   # add a frequency to each combination
  return(ddply(summ,.(group,year),transform,t=test(as.numeric(year),group,summ)*freq))
}

system.time(rollup())

user  system elapsed 
3.44    0.42    3.90

Grouping and conditions without loop (big data)

3 Answers