Grouping conditional linked values within a data.table

Question

I have a data.table with 3 input columns as follows and a fourth column representing my target output:

require(data.table)
Test <- data.table(Created = c(5,9,13,15,19,23,27,31,39,42,49),
                 Next_peak = c(9,15,15,23,27,27,31,39,49,49,50),
                 Valid_reversal = c(T,T,F,F,T,F,T,F,T,F,F),
                 Target_output = c(5,5,13,5,19,23,19,19,39,42,39))

I'm not sure if this is completely necessary, but I'll try to explain the dataset to hopefully make it easier to see what I'm trying to do. This is a little hard to explain in writing, so please bear with me!

The "Created" column represents the row number location of a price 'peak' (i.e. reversal point) in a time-series of financial data that I'm analysing. The "Next_peak" column represents the corresponding row number (in the original data set) of the next peak which exceeds the peak for that row. e.g. looking at row 1, the "Next_peak" value is 9, corresponding to the same row location as the "Created" level on row 2 of this summarised table. This means that the second peak exceeds the first peak. Conversely, in row 2 where the second peak's data is stored, the "Next peak" value of 15 suggests that it isn't until the 4th peak (i.e. corresponding to the '15' value in the "Created" column) that the second peak's price level is exceeded.

Lastly, the "Valid_reversal" column denotes whether the "Created" and "Next_peak" levels are within a predefined threshold. For example, "T" in the first row suggests that the peaks at rows 5 and 9 ("Next_peak") met this criteria. If I then go to the value of "Created" corresponding to a value of 9, there is also a "T", suggesting that the "Next_peak" value of 15 also meet the criteria. However, when I go to the 4th row where Created = 15, there is a "F", we find that the next peak does not meet the criteria.

What I'm trying to do is to link the 'chains' of valid reversal points and then return the original starting "Created" value. i.e. I want rows 1, 2 and 4 to have a value of '5', suggesting that the peaks for these rows were all within a predefined threshold of the original peak in row 5 of the original data-set. Conversely, row 3 should simply return 13 as there were no valid reversals at the "Next_peak" value of 15 relative to the peak formed at row 13.

I can create the desired output with the following code, however, it's not a workable solution as the number of steps could easily exceed 3 with my actual data sets where there are more than 3 peaks which are 'linked' with the same reversal point. I could do this with a 'for' loop, but I'm wondering if there is a better way to do this, preferably in a manner which is as vectorised as possible as the actual data set that I'm using contains millions of rows.

Here's my current approach:

 Test[Valid_reversal == T,Step0 := Next_peak]
 Test[,Step1 := sapply(seq_len(.N),function(x) ifelse(any(!(Created[x] %in% Step0[seq_len(x)])),
                                                               Created[x],NA))]
 Test[,Step2 := unlist(ifelse(is.na(Step1),
                  lapply(.I,function(x) Step1[which.max(Step0[seq_len(x-1)] == Created[x])]),
                       Step1))]     
 Test[,Step3 := unlist(ifelse(is.na(Step2),
                  lapply(.I,function(x) Step2[which.max(Step0[seq_len(x-1)] == Created[x])]),                                     
                       Step2))]

As you can see, while this data set only needs 3 iterations, the number of steps in the approach that I've taken is not definable in advance (as far as I can see). Therefore, to implement this approach, I'd have to repeat Step 2 until all values had been calculated, potentially via a 'while' loop. I'm struggling a little to work out how to do this.

Please let me know if you have any thoughts on how to address this in a more efficient way.

Thanks in advance,

Phil

Edit: Please note that I didn't mention in the above that the "Next_peak" values aren't necessarily monotonically increasing. The example above meant that nafill could be used, however, as the following example / sample output shows, it wouldn't give the correct output in the following instance:

  Test <- data.table(Created = c(5,9,13,15,19,23,27,31,39,42,49),
                 Next_peak = c(27,15,15,19,23,27,42,39,42,49,50),
                 Valid_reversal = c(T,T,F,T,F,F,T,F,F,T,F),
                 Target_output = c(5,9,13,9,9,23,5,31,39,5,5))

chinsoon12 chinsoon12 · Accepted Answer · 2020-06-26T02:09:35

Not sure if I understand your requirements correctly, you can use nafill after Step 1:

#step 0 & 1
Test[, out := 
    Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
        fifelse(is.na(x.Created), i.Created, NA_integer_)]
]

#your steps 2, 3, ...
Test[Valid_reversal | is.na(out), out := nafill(out, "locf")]

edit for the new example. You can use igraph to find the chains:

#step 0 & 1
Test[, out := 
    Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
        fifelse(is.na(x.Created), i.Created, NA_integer_)]
]

#steps 2, 3, ...
library(igraph)
g <- graph_from_data_frame(Test[Valid_reversal | is.na(out)])
DT <- setDT(stack(clusters(g)$membership), key="ind")[, 
    ind := as.numeric(levels(ind))[ind]][,
        root := min(ind), values]
Test[Valid_reversal | is.na(out), out := DT[.SD, on=.(ind=Created), root]]

just for completeness, here is a while loop version:

#step 0 & 1
Test[, out :=
    Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
        fifelse(is.na(x.Created), i.Created, NA_integer_)]
]

#step 2, 3, ...
while(Test[, any(is.na(out))]) {
    Test[is.na(out), out := Test[.SD, on=.(Next_peak=Created), mult="first", x.out]]
}
Test

Grouping conditional linked values within a data.table

1 Answers