0
votes

I have several observations of the same groups, and for each observation I have a year.

dat = data.frame(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995))

group   year
1   a   2000
2   a   1996
3   a   1975
4   b   2002
5   b   2010
6   b   1980
7   c   1990
8   c   1986
9   c   1995

For each observation, i would like to know if another observation of the same group can be found with given conditions relative to the focal observation. e.g. : "Is there any other observation (than the focal one) that has been done during the last 6 years (starting from the focal year) in the same group".

Ideally the dataframe should be like that

group   year  six_years
1   a   2000          1  # there is another member of group a that is year = 1996 (2000-6 = 1994, this value is inside the threshold)
2   a   1996          0
3   a   1975          0
4   b   2002          0
5   b   2010          0
6   b   1980          0
7   c   1990          1
8   c   1986          0
9   c   1995          1

Basically for each row we should look into the subset of groups, and see if any(dat$year == conditions). It is very easy to do with a for loop, but it's of no use here : the dataframe is massive (several millions of row) and a loop would take forever. I am searching for an efficient way with vectorized functions or a fast package.

Thanks !

3

3 Answers

0
votes

EDITED

Actually thinking about it you will probably have a lot of recurring year/group combinations, in which case much quicker to pre-calculate the frequencies using count() - which is also a plyr function:

90M rows took ~4sec

require(plyr)
dat <- data.frame(group = sample(c("a","b","c"),size=9000000,replace=TRUE), 
             year = sample(c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),size=9000000,replace=TRUE))


 test<-function(y,g,df){
 d<-df[df$year>=y-6 & 
          df$year<y &
          df$group== g,]
 return(nrow(d))
}

rollup<-function(){
  summ<-count(dat)                   # add a frequency to each combination
  return(ddply(summ,.(group,year),transform,t=test(as.numeric(year),group,summ)*freq))
}

system.time(rollup())

user  system elapsed 
3.44    0.42    3.90 
0
votes

My dataset had too many different groups, and the plyr option proposed by Troy was too slow. I found a hack (experts would probably say "an ugly one") with package data.table : the idea is to merge the data.table with itself quickly with the fast merge function. It gives every possible combination between a given year of a group and all others years from the same group. Then proceed with an ifelse for every row with the condition you're looking for. Finally, aggregate everything with a sum function to know how many times every given years can be found in a given timespan relative to another year. On my computer, it took few milliseconds, instead of the probable hours that plyr was going to take

dat = data.table(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995), key = "group")

Produces this :

group   year
1   a   2000
2   a   1996
3   a   1975
4   b   2002
5   b   2010
6   b   1980
7   c   1990
8   c   1986
9   c   1995

Then :

z = merge(dat, dat, by = "group", all = T, allow.cartesian = T) # super fast

z$sixyears = ifelse(z$year.y >= z$year.x - 6 & z$year.y < z$year.x, 1, 0) # creates a 0/1 column for our condition
z$sixyears = as.numeric(z$sixyears) # we want to sum this up after
z$year.y = NULL # useless column now
z2 = z[ , list(sixyears = sum(sixyears)), by = list(group, year.x)]

(Years with another year of the same group in the last six years are given a "1" :

  group year x
1     a 1975 0
2     b 1980 0
3     c 1986 0
4     c 1990 1  # e.g. here there is another "c" which was in the timespan 1990 -6 ..
5     c 1995 1  # <== this one. This one too has another reference in the last 6 years, two rows above.
6     a 1996 0
7     a 2000 1
8     b 2002 0
9     b 2010 0

Icing on the cake : it deals with NA seamlessly.

0
votes

Here's another possibility also using data.table but including diff().

dat <- data.table(group = rep(c("a","b","c"), each = 3), 
                  year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995), 
                  key = "group")
valid_case <- subset(dt[,list(valid_case = diff(year)), by=key(dt)], 
                     abs(valid_case)<6)
dat$valid_case <- ifelse(dat$group %in% valid_case$group, 1, 0)

I am not sure how this compares in terms of speed or NA handling (I think it should be fine with NAs since they propagate in diff() and abs()), but I certainly find it more readable. Joins are really fast in data.table, but I'd have to think avoiding that all together helps. There's probably a more idiomatic way to do the condition in the ifelse statement using data.table joins. That could potentially speed things up, although my experience has never found %in% to be the limiting factor.