Rolling cross join with data.table using nested IDs and missing data

Question

I have a dataset of portfolio holdings:

# Input test data
portolios <- structure(list(portfolioid = c(1L, 1L, 1L, 1L, 1L, 1L), secid = c("A", "B", "A", "C", "C", "A"), reportdate = c("2010-03-31", "2010-03-31", "2010-06-30", "2010-06-30", "2010-07-15", "2010-08-31"), report_type = c("Full", "Full", "Full", "Full", "Partial", "Full"), shares = c(100L, 100L, 130L, 50L, 75L, 80L)), .Names = c("portfolioid", "secid", "reportdate", "report_type", "shares"), row.names = c(NA, -6L), class = c("data.table", "data.frame"))

 portfolioid secid reportdate report_type shares
1:           1     A 2010-03-31        Full    100
2:           1     B 2010-03-31        Full    100
3:           1     A 2010-06-30        Full    130
4:           1     C 2010-06-30        Full     50
5:           1     C 2010-07-15     Partial     75
6:           1     A 2010-08-31        Full     80

I need to impute the following missing records:

7:           1    B 2010-06-30       Full       0
8:           1    C 2010-08-31       Full       0

The business issue is that sales of positions (shares = 0) are sometimes not reported for Full report_type, so the missing SecID have to be imputed based on the prior report.

Ultimately I am seeking to calculate the change in shares for each SecID from the prior report for each portfolioID such that my dataset looks like this:

changes <- structure(list(portfolioid = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), secid = c("A", "B", "A", "B", "C", "C", "A", "C"), reportdate = c("2010-03-31", "2010-03-31", "2010-06-30", "2010-06-30", "2010-06-30", "2010-07-15", "2010-08-31", "2010-08-31"), report_type = c("Full", "Full", "Full", "Full", "Full", "Partial", "Full", "Full"), shares = c(100L, 100L, 130L, 0L, 50L, 75L, 80L, 0L), change = c(100L, 100L, 30L, -100L, 50L, 25L, -50L, -75L)), .Names = c("portfolioid", "secid", "reportdate", "report_type", "shares", "change"), row.names = c(NA, -8L), class = c("data.table", "data.frame"))

   portfolioid secid reportdate report_type shares change
1:           1     A 2010-03-31        Full    100    100
2:           1     B 2010-03-31        Full    100    100
3:           1     A 2010-06-30        Full    130     30
4:           1     B 2010-06-30        Full      0   -100
5:           1     C 2010-06-30        Full     50     50
6:           1     C 2010-07-15     Partial     75     25
7:           1     A 2010-08-31        Full     80    -50
8:           1     C 2010-08-31        Full      0    -75

I am stuck with how to create the i for the outer join portfolios[i]. My issue is that I don't want to use i <- CJ(reportdate, secid) because it will produce too many records that are unnecessary since not every secid exists at every ReportDate and not correctly represent the data that needs filling.

I think I need a rolling cross-join between reportdate,reportdate[-1,secid]

I want to roll forward the secid and set shares := 0 when secid is missing in a Full report but it existed in prior report (either Partial or Full). I believe I would do this with option roll=1 but I am not sure where or how to implement.

I think my problem is similar to

How to Calculate a rolling statistic in R using data.table on unevenly spaced data

I am sure I am missing something basic in my understanding or a trick with CJ() that can create the requisite i

What are you want to join portolios to? To your older report? Where is it? You didn't provide it. Or you want to roll join to these specific two observations? It is very unclear (at least to me) what's going on. It would be easier if you'd just provide the data sets you have and then your desired output from the join — David Arenburg
@davidarenburg the only data provided is the portfolios table. I need to create or derive from this a table to join to, to implement the logic I described. The desired output is also shown as the changes table I provided. — superquant
I don't understand how do you know which are the missing records. You only specify shares := 0 but how do you know the secid and the date missing? Why 1 B 2010-08-31 Full 0 is not missing too, for example? — David Arenburg
@davidarenburg the missing records have to be imputed based on this rule: if a secid existed in a prior reportdate but is missing in the following Full reportdate then this secid should be carried forward and shares:=0. The trick is how to calculate which secid should be carried forward from the prior reportdate. This is why i believe roll=1 may be part of the solution. — superquant
@davidarenburg the reason 1 B 2010-08-31 Full 0 is not missing is because B is not in the prior report of 2010-06-30 in the input dataset. You could also look at it as only wanting to carry forward to the next period secid where shares <> 0. You do not want to carry forward filled missing values for shares = 0 — superquant

Mike.Gahan Mike.Gahan · Accepted Answer · 2014-12-14T22:37:16

Something like this should work (if I understand correctly)

First set reportdate with date class. Also get unique dates

portolios[, reportdate := as.IDate(reportdate)]
uniq.dts <- unique(portolios$reportdate)
uniq.dts <- uniq.dts[order(uniq.dts)]

Perform a self-join for each i and extract only dates occuring after we know the secid already exists (should be more memory efficient than a CJ)

setkey(portolios,secid)
setorder(portolios,sec,id,reportdate)

impute <- portolios[portolios, {
      tmp = max(reportdate) < uniq.dts;
        list(portfolioid=1,reportdate=uniq.dts[tmp][1],report_type="Full",shares=0)
},by=.EACHI][!is.na(reportdate)][,.SD[1],by=secid]

Next, rbindlist original table and the impute table.

portolios <- rbindlist(list(portolios,impute),fill=TRUE)

#Order data by secid and reportdate
portolios <- portolios[order(secid,reportdate)]

#Lag data by group
portolios[, prev.shares := c(NA,lag(shares)), by=secid]

#Calculate change WHEN a previous share amount exists
portolios[, change := ifelse(is.na(prev.shares),shares,shares-prev.shares), by=secid]

print(portolios[order(reportdate)])
   portfolioid secid reportdate report_type shares prev.shares change
1:           1     A 2010-03-31        Full    100          NA    100
2:           1     B 2010-03-31        Full    100          NA    100
3:           1     A 2010-06-30        Full    130         100     30
4:           1     B 2010-06-30        Full      0         100   -100
5:           1     C 2010-06-30        Full     50          NA     50
6:           1     C 2010-07-15     Partial     75          50     25
7:           1     A 2010-08-31        Full     80         130    -50
8:           1     C 2010-08-31        Full      0          75    -75

Rolling cross join with data.table using nested IDs and missing data

1 Answers