2
votes

I currently have two data.table tables. The first looks like:

    > data1
    Person       Date
    1            2007-1-1 
    2            2007-1-3
    3            2007-1-9
    4            2007-1-17
    5            2007-1-30

The second looks like:

   > data2
   Person       Date
   1            2007-1-2 
   1            2007-1-3
   1            2007-1-5
   2            2007-1-4
   2            2007-1-6
   2            2007-1-7
   2            2007-1-8
   3            2007-1-19
   4            2007-1-19
   4            2007-1-25
   5            2007-2-28
   5            2007-3-5

I would like to have:

    Person       Date            Between
    1            2007-1-1        1
    2            2007-1-3        4
    3            2007-1-9        0
    4            2007-1-17       2
    5            2007-1-30       2

Here, I would like to look at the first person in the first data.table and look at the gap in the date between the 1st person and 2nd, so that the gap is: 2007-1-1 to 2007-1-3. Then, I would like to do a search on the second by person, and return how many dates in the second data.table corresponding to person 1 is between 2007-1-1 to 2007-1-3. Here, there was only 1 instance, so we put 1 in between.

For the second case, there are four instances in the second data set with dates between 2007-1-3 and 2007-1-9, so the Between column takes on the value 4. For the last instance, Person 5, we have that 2007-2-31 and 2007-3-5 are two dates past the value for Person 5 in the first table: 2007-1-30.

I have been able to write a for-loop for this,

vector.data <- rep(NA, 5)
for(i in 1:5){
index <- which(data1$date == data2$date)
data1[index,]
}

However, my main concern is the speed. I would like to do this for a data set that is to the order of 100 millions rows. Hence, I was wondering if there was a data.table solution or some other fast solution for this. Thank you!

1
With a little bit of work in setting up date ranges properly, you can accomplish this using overlap joins that's recently implemented in 1.9.3. Here's an example. You can install it from the github project page and give it a try.Arun

1 Answers

1
votes

This seemed to do the trick, and should be relatively fast:

> dt2$MaxDate = dt1[dt2$Person + 1, "Date"]
> dt2$MinDate = dt1[dt2$Person, "Date"]
> dt2[dt2$Person == max(dt2$Person),]$MaxDate = Sys.Date() #Last person can be any time
> dt2$IsBetween = with(dt2, Date > MinDate & Date < MaxDate)

So here's what the table looks like now:

> dt2
   Person       Date    MaxDate    MinDate IsBetween
1       1 2007-01-02 2007-01-03 2007-01-01      TRUE
2       1 2007-01-03 2007-01-03 2007-01-01     FALSE
3       1 2007-01-05 2007-01-03 2007-01-01     FALSE
4       2 2007-01-04 2007-01-09 2007-01-03      TRUE
5       2 2007-01-06 2007-01-09 2007-01-03      TRUE
6       2 2007-01-07 2007-01-09 2007-01-03      TRUE
7       2 2007-01-08 2007-01-09 2007-01-03      TRUE
8       3 2007-01-19 2007-01-17 2007-01-09     FALSE
9       4 2007-01-19 2007-01-30 2007-01-17      TRUE
10      4 2007-01-25 2007-01-30 2007-01-17      TRUE
11      5 2007-02-28 2014-09-17 2007-01-30      TRUE
12      5 2007-03-05 2014-09-17 2007-01-30      TRUE

Use tapply to group results:

> dt1$Between = tapply(dt2$IsBetween, dt2$Person, sum)
> dt1
  Person       Date Between
1      1 2007-01-01       1
2      2 2007-01-03       4
3      3 2007-01-09       0
4      4 2007-01-17       2
5      5 2007-01-30       2

I used base data.frame rather than data.table because the identical column names would make scoping confusing. In this case I think performance should be fine