I am using R with package data.table and I would like to group a data.table by running (time) intervals or overlapping bins. For each of these running intervals I would like to find the occurence of equal pairs of data. Further more these "equal pairs of data" should be not exactly equal, but in some interval range, too.
The simple version of the question is as following:
#Time X Y Counts
# ... ... ... 1
#I would like to do:
DT[, sum(counts), by = list(Time, X, Y)]
#with Time, X and Y being in overlapping intervals.
findintervals()
would give me bins with "hard borders", not overlapping ones.
The problem in more detail: Let's say I have a data.table like that:
Time <- c(1,1,2,4,5,5,6,7,8,8,8,8,12,13)
#more equal time values are allowed.
X <- c(6,6,7,10,5,7,6,3,9,10,6,3,3,6)
Y <- c(2,6,10,3,4,6,6,9,4,9,6,6,9,9)
DT <- data.table(Time, X, Y)
Time X Y
1: 1 6 2
2: 1 6 6
3: 2 7 10
4: 4 10 3
5: 5 5 4
6: 5 7 6
7: 6 6 6
8: 7 3 9
9: 8 9 4
10: 8 10 9
11: 8 6 6
12: 8 3 6
13: 12 3 9
14: 13 6 9
And some predefined interval sizes:
Timeinterval <- 5
#for a time value of 10 this means to look from 10-5 to 10+5
RangeX.percentage <- 0.5
RangeY.percentage <- 0.5
The result should give me an additional column, let's call it "counts" with the occurence of equal pairs of data X and Y considering the ranges for X and Y.
I thought about some kind of grouping by time intervals like
c(1, 1, 2, 4, 5, 5, 6) #for the first item: (1-5):(1+5)
c(1, 1, 2, 4, 5, 5, 6, 7) # for the second item: (1-5):(1+5)
c(1, 1, 2, 4, 5, 5, 6, 7, 8, 8, 8, 8) #for the third item (2-5):(2+5)
#...
c(8, 8, 8, 8, 12, 13) # for the last item (13-5):(13+5)
and the following conditions for the data (but maybe there is a simpler version for that part too):
EDIT: To clearify what the result should look like:
Ranges <- DT[ , list(
X* (1 + RangeX.percentage), X* (1 - RangeX.percentage),
Y* (1 + RangeY.percentage), Y* (1 - RangeY.percentage))]
DT2 <- cbind(DT, Ranges, count = rep(1, nrow(DT)))
setnames(DT2, c("Time","X","Y","XR1","XR2","YR1","YR2","count"))
for (i in 1:nrow(DT2)){
#main part of the question how to get this done within data.table:
DT2.subset <- DT2[which(abs(Time - DT2[i]$Time) < Timeinterval)]
#subsequent comparison of X and Y:
DT[i]$Count<- length(which(DT2.subset$X < DT2[i]$XR1 &
DT2.subset$X > DT2[i]$XR2 &
DT2.subset$Y < DT2[i]$YR1 &
DT2.subset$Y > DT2[i]$YR2))
}
DT2
Time X Y XR1 XR2 YR1 YR2 count
1: 1 6 2 9.0 3.0 3.0 1.0 1
2: 1 6 6 9.0 3.0 9.0 3.0 3
3: 2 7 10 10.5 3.5 15.0 5.0 4
4: 4 10 3 15.0 5.0 4.5 1.5 3
5: 5 5 4 7.5 2.5 6.0 2.0 1
6: 5 7 6 10.5 3.5 9.0 3.0 6
7: 6 6 6 9.0 3.0 9.0 3.0 4
8: 7 3 9 4.5 1.5 13.5 4.5 2
9: 8 9 4 13.5 4.5 6.0 2.0 3
10: 8 10 9 15.0 5.0 13.5 4.5 4
11: 8 6 6 9.0 3.0 9.0 3.0 4
12: 8 3 6 4.5 1.5 9.0 3.0 1
13: 12 3 9 4.5 1.5 13.5 4.5 2
14: 13 6 9 9.0 3.0 13.5 4.5 1
As my complete data.table contains more than a million rows, checking all DT$time for each row is a mess in terms of computation time.