I have a simple dataframe containing three columns: An id, a date and a value. Now, I want to calculate a new value, newValue, based on these three columns following this procedure:
- For each row (i.e., for each pair of (id, date))
- For all dates in the range (date, date+2) I want to find the cumulative product of the values of that id (and then subtract 1)
The simple example below with made-up numbers does the computation:
df <- data.frame("id"=rep(1:10, 5),
"date"=c(rep(2000, 10), rep(2001, 10), rep(2002, 10), rep(2003, 10), rep(2004, 10)),
"value"=c(rep(1, 10), rep(2, 10), rep(3, 10), rep(4, 10), rep(5, 10)))
df$newValue <- 1 #initialize
for(idx in 1:dim(df)[1]) {
id <- df[idx, "id"]
lower <- df[idx, "date"]
upper <- lower + 3
df[idx, "newValue"] <- prod(df[(df$id == id) & (df$date >= lower) & (df$date < upper), ]$value + 1) - 1
}
This gives me the output (I have annotated it for simplicity):
id date value newValue
1 1 2000 1 23 (= (1+1) * (2+1) * (3+1) - 1 = 23)
2 2 2000 1 23 (= (1+1) * (2+1) * (3+1) - 1 = 23)
....
12 2 2001 2 59 (= (2+1) * (3+1) * (4+1) - 1 = 59)
....
22 2 2002 3 119 (= (3+1) * (4+1) * (5+1) - 1 = 119)
....
However, my final dataframe has +1million rows, so the code above is very time-consuming and inefficient.
Is there a way to speed it up, perhaps using a data.table? Note that each id may have a different number of rows, so that I why I explicitly subset.