2
votes

I am trying to test a variety of conditions in a data.table that looks like this reproducible example

 set.seed(17)
 year <- 1980 + rnbinom(10000,3,0.35)
 event <- rep(LETTERS, length.out=10000)
 z <- as.integer(runif(10000,min = 0, max = 10))
 dt <- data.table(event,year,z)
 setkey(dt, event,year)
 dt <- dt[,sum(z), by=c("event","year")]

V1 (which emerges from the last command) represents a count of event occurences.

So the data table is an ordered array and I need to execute a variety of functions on it. Here are some examples:

  1. How do I calculate a rolling sum (or rolling mean) of the occurences in 10 prior years for each event? So for A 1990 the desired output is 1,452 (between 1980 and 1989). For H 2012, the output is 11 because between 2002 and 2011 there are only 11 occurences (3 in 2002, 3 in 2007, and 5 in 2010). For A 1983 the output is NA

  2. How can I check whether an event occurs in at least 12 out of 15 prior years? So for A 1997 we can see that the event occurred in more than 12 years in the 15 years prior (1982 - 1996, it happened in every year besides 1996) thus criterium met. However, for A 2001 we see that the event only occurs in 11 of 15 prior years (1986 - 2000), it does not happen in 1996,1998,1999,and 2000) criterium not met. The desired output here would be a discrete 1 (criterium met) or 0 (criterium not met)

Ideally the code would enable the calculation of both 1 and 2 not only for years that occur in the data.table but also for those between 1980 and 2013 that are missing. So for K 2005, we can calculate the outcome for Q1 as 25 (13 + 5 + 3 + 3 + 2) (thanks @Arun for pointing the former error out). For Q2, we see the event does not occur in 1999,2000,2001,2003, and 2004 hence the criterium "at least in 12 out of 15 years" is not met. Also, it is possible that the event-year combination exists in the data.table but that V1 has value 0 (see row 18, A 2001). Ideally, such zero occurences would be treated as non-occurences (e.g. by deleting all rows for which V1 is zero).

I know it's uncommon to post two questions but I feel they belong together and really relate to similar problems. Hope someone can make some suggestions.

Thanks a lot,

Simon

1
You are right. It seems that my set.seed() did not work for some reason. Damn that is annoying. I will have to update every value I mentioned :( - SJDS
@Arun yes that is correct. No idea where I got 10 from, that is not even an option if I had forgotten the first or last one. Apologies - SJDS
Hi, the reason why I expect A, 1983 to be NA is much simpler. I don't have 10 years worth of data before 1983. (My complete dataset starts in 1980. I intend to use this to define which events have occurred frequently in the past. In the actual outcome I reduce my panel to as from 1990 - 2006. The years after are for out of sample prediction and the years before (1980-1989) to create such historical frequencies. - SJDS
Regarding e.g. Z, 2012 I understand the confusion. I have a dataset that actually goes to and includes 2013. But as the fake data show, the incidence of events decreases dramatically near the end. So in short Z, 2012 could exist in my data and requires a value, but I cannot go back further than 1980 hence every rolling mean before 1989 cannot have 10 years prior information. (Depending on the findings I will tweak this to 7 years or 8 or maybe 11)... Hope this clarifies it and thanks so much for your efforts! - SJDS

1 Answers

2
votes

For your first question:

This'll get the running sum for years that are not necessarily in the dataset as well (as you requested just underneath the two points). The idea is to first generate all combinations of event and year - even the ones which doesn't exist in the dataset. This can be accomplished by the function CJ (for crossjoin). This'll, for each event, create all year.

setkey(dt, event, year)
d1 = CJ(event=unique(dt$event), year=min(dt$year):max(dt$year))

Now, we join back with dt to fill the missing values for V1 with NA.

d1 = dt[d1]

Now we've a dataset with all combinations of event and year. From here, we've to now find a way to perform the rolling sum. For this, we create, yet again, another dataset, which contains all the previous 10 years, for each year, as follows:

window_size = 10L
d2 = d1[, list(window = seq(year-window_size, year-1L, by=1L)), by="event,year"]

For each "event,year", we create a new column window, that'll generate the previous 10 years.

Now, all we've to do is to set the key columns appropriately and perform a join to get the corresponding "V1" values.

setkey(d2, event, window) ## note the join here is on "event, window"
setkey(d1, event, year)

ans = d1[d2]

Now, we've the values of "V1" for each "event,window" combination. All we've to do is aggregate by "event,year.1" ("year.1" was previously "year", and "year" in ans was previously "window"). Here, we take care of the condition that if any of the years is < 1980, then the sum should be NA. This is done by using a small hack that TRUE | NA = TRUE and FALSE | NA = NA.

q1 = ans[, sum(V1, na.rm=TRUE) * (!any(year < 1980) | NA), by="event,year.1"]

q1[event == "K" & year.1 == "2005"]
#    event year.1 V1
# 1:     K   2005 25

For your second question:

Repeat the same as above with window_size = 15L instead of 10L and get up until ans. Then, we can do:

q2 = ans[!is.na(V1)][, .N, by="event,year.1"]

q2[event == "A" & year.1 == 1997]
#    event year.1  N
# 1:     A   1997 14

This is correct because dt has all years from 1982-1995, and 1996 is missing and therefore not counted => N=14, as it should be.