I currently have two data.table tables. The first looks like:
> data1
Person Date
1 2007-1-1
2 2007-1-3
3 2007-1-9
4 2007-1-17
5 2007-1-30
The second looks like:
> data2
Person Date
1 2007-1-2
1 2007-1-3
1 2007-1-5
2 2007-1-4
2 2007-1-6
2 2007-1-7
2 2007-1-8
3 2007-1-19
4 2007-1-19
4 2007-1-25
5 2007-2-28
5 2007-3-5
I would like to have:
Person Date Between
1 2007-1-1 1
2 2007-1-3 4
3 2007-1-9 0
4 2007-1-17 2
5 2007-1-30 2
Here, I would like to look at the first person in the first data.table and look at the gap in the date between the 1st person and 2nd, so that the gap is: 2007-1-1 to 2007-1-3. Then, I would like to do a search on the second by person, and return how many dates in the second data.table corresponding to person 1 is between 2007-1-1 to 2007-1-3. Here, there was only 1 instance, so we put 1 in between.
For the second case, there are four instances in the second data set with dates between 2007-1-3 and 2007-1-9, so the Between column takes on the value 4. For the last instance, Person 5, we have that 2007-2-31 and 2007-3-5 are two dates past the value for Person 5 in the first table: 2007-1-30.
I have been able to write a for-loop for this,
vector.data <- rep(NA, 5)
for(i in 1:5){
index <- which(data1$date == data2$date)
data1[index,]
}
However, my main concern is the speed. I would like to do this for a data set that is to the order of 100 millions rows. Hence, I was wondering if there was a data.table solution or some other fast solution for this. Thank you!
1.9.3
. Here's an example. You can install it from the github project page and give it a try. – Arun