data.table+r: cumulative mean on a column after sub setting using two other dates columns

Question

I have a data.frame DF with 4 columns given below.

DF <- structure(list(Ticker = c("ABC", "ABC", "ABC", "ABC","ABC","ABC","ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC","ABC","XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ","XYZ", "XYZ", "XYZ", "XYZ"), `Enter Date` = c("2005-02-08", "2005-02-23","2005-06-07", "2005-06-08", "2005-08-16", "2005-09-07", "2005-11-15","2005-11-17", "2005-12-06", "2005-12-23", "2006-02-09", "2006-02-10","2006-02-15", "2006-02-22", "2006-05-01", "2005-02-22", "2005-02-28","2005-03-01", "2005-03-03", "2005-03-04", "2005-03-11", "2005-03-15","2005-04-04", "2005-04-05", "2005-04-15", "2005-04-22", "2005-04-28"), `Exit Date` = c("2005-03-09", "2005-03-23", "2005-07-06","2005-07-07", "2005-09-14", "2005-10-05", "2005-12-14", "2005-12-16","2006-01-05", "2006-01-25", "2006-03-10", "2006-03-13", "2006-03-16","2006-03-22", "2006-05-30", "2005-03-22", "2005-03-29", "2005-03-30","2005-04-01", "2005-04-04", "2005-04-11", "2005-04-13", "2005-05-02","2005-05-03", "2005-05-13", "2005-05-20","2005-05-26"), Return = c(4.669,4.034, 3.796, -4.059, -11.168, -0.496,-3.597, 3.45, -4.428,1.914, 3.577, 4, 8.451, 5.521, 10.324, 3.104, 0.787,-3.407,-1.441, -4.157, 4.343, 2.827, 0.425, -1.37, -3.175, -11.027,8.144)), .Names = c("Ticker", "Enter Date", "Exit Date", "Return"), row.names = c(NA, 27L), class = "data.frame")

I would like to calculate cumulative mean of “Return” column, where ‘Enter Date’ > ‘Exit Date’ for unique Enter Date and for each Ticker. I can do it in data.frame way in two steps . The code I used is

calCumAve <- function(data,yvar,nSkip)
{
nrs <- seq_len(nrow(data))
CumAve <-  c(rep(NA,nSkip),sapply(nrs[nrs>nSkip],
FUN=function(t){mean(data[data$"Enter Date"[t]> data$"Exit Date", yvar])}))
return(CumAve)
}

DFOut <-  do.call(rbind,lapply(sort(unique(DF$Ticker)), FUN=function(s){
                     sd <- DF[DF$Ticker==s,]
                     sd$AvgRet <- calCumAve(data=sd,yvar="Return",nSkip=4)
                     return(sd)}))

The required output is DFOut.

I would like to do this operation in data.table way. While applying in data.table, the main problem I’m facing is to sub setting of Return column using two date columns. Few things to consider:

(1) In practice there will be 1000 tickers (only 2 in this example, ABC and XYZ) and more than 10 years of daily data.

(2) Do the operation without specifying nSkip. For Enter Date <= Exit Date it should give NA (not NaN as in row 20:22 in DFOut)

(3) If possible, use column names in sub setting data.table. The given example has four columns but the working data.table will have more than 25 columns, and I need to apply same calculation on several columns by changing yvar.

Any help is much appreciated. Thanks in advance.

Fyi, in your base R code, the standard way is lapply(split(DF, DF$Ticker), ...) — Frank

eddi eddi · Accepted Answer · 2016-02-17T17:47:22

dt = as.data.table(DF) # or setDT to convert in place

# cumulative mean, but without the date restriction
dt[, rawAvgRets := cumsum(Return) / (1:.N), by = Ticker]

# find the latest matching date using a rolling merge (assumes sorted dates)
# if you run into > vs >= issues, adjust enter or exit date by a day
dt[, avgRets := dt[dt, rawAvgRets, roll = TRUE,
                   on = c('Ticker' = 'Ticker', 'Exit Date' = 'Enter Date')]]
#    Ticker Enter Date  Exit Date  Return rawAvgRets    avgRets
# 1:    ABC 2005-02-08 2005-03-09   4.669  4.6690000         NA
# 2:    ABC 2005-02-23 2005-03-23   4.034  4.3515000         NA
# 3:    ABC 2005-06-07 2005-07-06   3.796  4.1663333  4.3515000
# 4:    ABC 2005-06-08 2005-07-07  -4.059  2.1100000  4.3515000
# 5:    ABC 2005-08-16 2005-09-14 -11.168 -0.5456000  2.1100000
# 6:    ABC 2005-09-07 2005-10-05  -0.496 -0.5373333  2.1100000
# 7:    ABC 2005-11-15 2005-12-14  -3.597 -0.9744286 -0.5373333
# 8:    ABC 2005-11-17 2005-12-16   3.450 -0.4213750 -0.5373333
# 9:    ABC 2005-12-06 2006-01-05  -4.428 -0.8665556 -0.5373333
#10:    ABC 2005-12-23 2006-01-25   1.914 -0.5885000 -0.4213750
#11:    ABC 2006-02-09 2006-03-10   3.577 -0.2098182 -0.5885000
#12:    ABC 2006-02-10 2006-03-13   4.000  0.1410000 -0.5885000
#13:    ABC 2006-02-15 2006-03-16   8.451  0.7802308 -0.5885000
#14:    ABC 2006-02-22 2006-03-22   5.521  1.1188571 -0.5885000
#15:    ABC 2006-05-01 2006-05-30  10.324  1.7325333  1.1188571
#16:    XYZ 2005-02-22 2005-03-22   3.104  3.1040000         NA
#17:    XYZ 2005-02-28 2005-03-29   0.787  1.9455000         NA
#18:    XYZ 2005-03-01 2005-03-30  -3.407  0.1613333         NA
#19:    XYZ 2005-03-03 2005-04-01  -1.441 -0.2392500         NA
#20:    XYZ 2005-03-04 2005-04-04  -4.157 -1.0228000         NA
#21:    XYZ 2005-03-11 2005-04-11   4.343 -0.1285000         NA
#22:    XYZ 2005-03-15 2005-04-13   2.827  0.2937143         NA
#23:    XYZ 2005-04-04 2005-05-02   0.425  0.3101250 -1.0228000
#24:    XYZ 2005-04-05 2005-05-03  -1.370  0.1234444 -1.0228000
#25:    XYZ 2005-04-15 2005-05-13  -3.175 -0.2064000  0.2937143
#26:    XYZ 2005-04-22 2005-05-20 -11.027 -1.1900909  0.2937143
#27:    XYZ 2005-04-28 2005-05-26   8.144 -0.4122500  0.2937143
#    Ticker Enter Date  Exit Date  Return rawAvgRets    avgRets

data.table+r: cumulative mean on a column after sub setting using two other dates columns

1 Answers