Subsetting datatable based on consecutive difference of dates

Question

Preamble: The main problem is how to subset a datatable based on IDs, forming subsets within an ID based on consecutive time differences. A hint regarding this would be most welcome.

The complete question/setup: I have a dataset dt in data.table format that looks like

  date       id val1 val2
  %d.%m.%Y
1 01.01.2000  1   5    10
2 09.01.2000  1   4     9
3 01.08.2000  1   3     8
4 01.01.2000  2   2     7
5 01.01.2000  3   1     6
6 14.01.2000  3   7     5
7 28.01.2000  3   8     4
8 01.06.2000  3   9     3

I want to combine observations (grouped by id) which are not more than two weeks apart (consecutively from observation to observation). By combining I mean that for each subset, I

keep the value of the last observation of val1
replace val2 of the last observation with the sum of all values of val2 of the group
add counter for how many observations came together in this group.

I.e., I want to end up with a dataset like this

  date       id val1 val2 counter
  %d.%m.%Y
2 09.01.2000  1   4    19       2
3 01.08.2000  1   3     8       1
4 01.01.2000  2   2     7       1
7 28.01.2000  3   8    15       3
8 01.06.2000  3   9     3       1

Still, I am trying to wrap my head around data.table functions, particularly .SD and want to solve the issue with these tools.

So far I know

that I can indicate what I mean by first and last using setkey(dt,date)
that I can replace the last val2 of a subset with the sum dt[, val2 := replace(val2, .N, sum(val2[-.N], na.rm = TRUE)), by=id]
that I get the length of a subset with [.N]
how to delete rows
that I can calculate the difference between two dates with difftime(strptime(dt$date[1],format ="%d.%m.%Y"),strptime(dt$date[2],format ="%d.%m.%Y"),units="weeks")

However I can't get my head around how to subset the observations such that each subset contains only groups of observations of the same id with dates of (consecutive) distance at max 2 weeks.

Any help is appreciated. Many thanks in advance.

user2332849 user2332849 · Accepted Answer · 2020-03-22T17:52:57

The trick is to use cumsum() on a condition. In this case, the condition is being more than 14 days. When the condition is true, the cumulative sum increments.

df %>%
  mutate(rownumber = row_number()) %>%
  group_by(id) %>%
  mutate(interval = as.numeric(as.Date(date, format = "%d.%m.%Y") - as.Date(lag(date), format = "%d.%m.%Y"))) %>%
  mutate(interval = ifelse(is.na(interval), 0, interval)) %>%
  mutate(group = cumsum(interval > 14) + 1) %>%
  ungroup() %>%
  group_by(id, group) %>%
  summarise(
    rownumber = last(rownumber),
    date = last(date),
    val1 = last(val1),
    val2 = sum(val2),
    counter = n()
  ) %>%
  select(rownumber, date, id, val1, val2, counter)

Output

  rownumber date          id  val1  val2 counter
      <int> <chr>      <int> <int> <int>   <int>
1         2 09.01.2000     1     4    19       2
2         3 01.08.2000     1     3     8       1
3         4 01.01.2000     2     2     7       1
4         7 28.01.2000     3     8    15       3
5         8 01.06.2000     3     9     3       1

Subsetting datatable based on consecutive difference of dates

1 Answers