Preamble: The main problem is how to subset a datatable based on IDs, forming subsets within an ID based on consecutive time differences. A hint regarding this would be most welcome.
The complete question/setup:
I have a dataset dt in data.table format that looks like
date id val1 val2
%d.%m.%Y
1 01.01.2000 1 5 10
2 09.01.2000 1 4 9
3 01.08.2000 1 3 8
4 01.01.2000 2 2 7
5 01.01.2000 3 1 6
6 14.01.2000 3 7 5
7 28.01.2000 3 8 4
8 01.06.2000 3 9 3
I want to combine observations (grouped by id) which are not more than two weeks apart (consecutively from observation to observation). By combining I mean that for each subset, I
- keep the value of the last observation of
val1 - replace
val2of the last observation with the sum of all values ofval2of the group - add
counterfor how many observations came together in this group.
I.e., I want to end up with a dataset like this
date id val1 val2 counter
%d.%m.%Y
2 09.01.2000 1 4 19 2
3 01.08.2000 1 3 8 1
4 01.01.2000 2 2 7 1
7 28.01.2000 3 8 15 3
8 01.06.2000 3 9 3 1
Still, I am trying to wrap my head around data.table functions, particularly .SD and want to solve the issue with these tools.
So far I know
- that I can indicate what I mean by first and last using
setkey(dt,date) - that I can replace the last
val2of a subset with the sumdt[, val2 := replace(val2, .N, sum(val2[-.N], na.rm = TRUE)), by=id] - that I get the length of a subset with
[.N] - how to delete rows
- that I can calculate the difference between two dates with
difftime(strptime(dt$date[1],format ="%d.%m.%Y"),strptime(dt$date[2],format ="%d.%m.%Y"),units="weeks")
However I can't get my head around how to subset the observations such that each subset contains only groups of observations of the same id with dates of (consecutive) distance at max 2 weeks.
Any help is appreciated. Many thanks in advance.