I have a data.frame DF with 4 columns given below.
DF <- structure(list(Ticker = c("ABC", "ABC", "ABC", "ABC","ABC","ABC","ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC","ABC","XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ", "XYZ","XYZ", "XYZ", "XYZ", "XYZ"), `Enter Date` = c("2005-02-08", "2005-02-23","2005-06-07", "2005-06-08", "2005-08-16", "2005-09-07", "2005-11-15","2005-11-17", "2005-12-06", "2005-12-23", "2006-02-09", "2006-02-10","2006-02-15", "2006-02-22", "2006-05-01", "2005-02-22", "2005-02-28","2005-03-01", "2005-03-03", "2005-03-04", "2005-03-11", "2005-03-15","2005-04-04", "2005-04-05", "2005-04-15", "2005-04-22", "2005-04-28"), `Exit Date` = c("2005-03-09", "2005-03-23", "2005-07-06","2005-07-07", "2005-09-14", "2005-10-05", "2005-12-14", "2005-12-16","2006-01-05", "2006-01-25", "2006-03-10", "2006-03-13", "2006-03-16","2006-03-22", "2006-05-30", "2005-03-22", "2005-03-29", "2005-03-30","2005-04-01", "2005-04-04", "2005-04-11", "2005-04-13", "2005-05-02","2005-05-03", "2005-05-13", "2005-05-20","2005-05-26"), Return = c(4.669,4.034, 3.796, -4.059, -11.168, -0.496,-3.597, 3.45, -4.428,1.914, 3.577, 4, 8.451, 5.521, 10.324, 3.104, 0.787,-3.407,-1.441, -4.157, 4.343, 2.827, 0.425, -1.37, -3.175, -11.027,8.144)), .Names = c("Ticker", "Enter Date", "Exit Date", "Return"), row.names = c(NA, 27L), class = "data.frame")
I would like to calculate cumulative mean of “Return” column, where ‘Enter Date’ > ‘Exit Date’ for unique Enter Date and for each Ticker. I can do it in data.frame way in two steps . The code I used is
calCumAve <- function(data,yvar,nSkip)
{
nrs <- seq_len(nrow(data))
CumAve <- c(rep(NA,nSkip),sapply(nrs[nrs>nSkip],
FUN=function(t){mean(data[data$"Enter Date"[t]> data$"Exit Date", yvar])}))
return(CumAve)
}
DFOut <- do.call(rbind,lapply(sort(unique(DF$Ticker)), FUN=function(s){
sd <- DF[DF$Ticker==s,]
sd$AvgRet <- calCumAve(data=sd,yvar="Return",nSkip=4)
return(sd)}))
The required output is DFOut.
I would like to do this operation in data.table way. While applying in data.table, the main problem I’m facing is to sub setting of Return column using two date columns. Few things to consider:
(1) In practice there will be 1000 tickers (only 2 in this example, ABC and XYZ) and more than 10 years of daily data.
(2) Do the operation without specifying nSkip. For Enter Date <= Exit Date it should give NA (not NaN as in row 20:22 in DFOut)
(3) If possible, use column names in sub setting data.table. The given example has four columns but the working data.table will have more than 25 columns, and I need to apply same calculation on several columns by changing yvar.
Any help is much appreciated. Thanks in advance.
lapply(split(DF, DF$Ticker), ...)- Frank