I want to select specific rows values (here TARGET) of a data.table (DT1) where the filter criterions are in an other data.table (DT2).
It is not an exact filter, because if I have value 3 in DT2 I have a minimum and a maximum variable for this value in DT1. Also I have a string which shall contain a specific pattern.
E.g : A = 3 in DT2 and the corresponding line in DT1 contains minA = 3, maxA = 6 and C = "Mon" which is in C = "Mon,Tue".
DT1
INDEX1 minA maxA C TARGET
9 : 9 3 6 Mon,Tue 109
DT2
A C INDEX2
1: 3 Mon 1
I am looking for the line with the range in which this value lies and also with the maximum target value.
I have the following simplified example:
# version 1.9.6
library(data.table)
DT1 <- data.table(INDEX1 = 1:12,
minA = c(1,1,1,2,2,2,3,3,3,4,4,4),
maxA = c(4,5,6),
C = c("Mon,Tue", "Mon,Wed", "Tue,Thu", "Wed,Thu"),
TARGET = c(101:112))
size <- 2
DT2 <- data.table(A = rep(c(3,4), size),
C = rep(c("Mon", "Thu"), size),
INDEX2 = 1:(2*size))
which looks like this:
DT1
INDEX1 minA maxA C TARGET
1 : 1 1 4 Mon,Tue 101
2 : 2 1 5 Mon,Wed 102
3 : 3 1 6 Tue,Thu 103
4 : 4 2 4 Wed,Thu 104
5 : 5 2 5 Mon,Tue 105
6 : 6 2 6 Mon,Wed 106
7 : 7 3 4 Tue,Thu 107
8 : 8 3 5 Wed,Thu 108
9 : 9 3 6 Mon,Tue 109
10: 10 4 4 Mon,Wed 110
11: 11 4 5 Tue,Thu 111
12: 12 4 6 Wed,Thu 112
DT2
A C INDEX2
1: 3 Mon 1
2: 4 Thu 2
I included size just for scaling and testing.
My solution so far is the following:
I wrote a function foo() which takes the filter-input values and returns the index (or some other more usefull variable) of DT1.
foo <- function(i.A, i.C){
DT1[INDEX1 %in% grep(i.C,C) & minA <= i.A & maxA >= i.A,][TARGET == max(TARGET),]
}
I call this function for each row of DT2
DT2[, foo(i.A = A, i.C = C), by = INDEX2]
with the outout:
INDEX2 INDEX1 minA maxA C TARGET
1: 1 9 3 6 Mon,Tue 109
2: 2 12 4 6 Wed,Thu 112
And here is my problem:
This works fine for small data.tables, but I have a lot more rows in DT2. The functions takes much longer and i was wondering if there is a better / faster way for this kind of filtering?
Maybe it is possible to "upgrade" foo() so that it can process the whole column instead of single row?
If possible i like to avoid to expand my DT1 like here:
and I think, I have a more complex filter than in these questions:
- How to filter cases in a data.table by multiple conditions defined in another data.table
- R: efficiently select specified rows from a data.table according to another data.table?
Thanks in advance for any help.