0
votes

Preamble: The main problem is how to subset a datatable based on IDs, forming subsets within an ID based on consecutive time differences. A hint regarding this would be most welcome.

The complete question/setup: I have a dataset dt in data.table format that looks like

  date       id val1 val2
  %d.%m.%Y
1 01.01.2000  1   5    10
2 09.01.2000  1   4     9
3 01.08.2000  1   3     8
4 01.01.2000  2   2     7
5 01.01.2000  3   1     6
6 14.01.2000  3   7     5
7 28.01.2000  3   8     4
8 01.06.2000  3   9     3

I want to combine observations (grouped by id) which are not more than two weeks apart (consecutively from observation to observation). By combining I mean that for each subset, I

  • keep the value of the last observation of val1
  • replace val2 of the last observation with the sum of all values of val2 of the group
  • add counter for how many observations came together in this group.

I.e., I want to end up with a dataset like this

  date       id val1 val2 counter
  %d.%m.%Y
2 09.01.2000  1   4    19       2
3 01.08.2000  1   3     8       1
4 01.01.2000  2   2     7       1
7 28.01.2000  3   8    15       3
8 01.06.2000  3   9     3       1

Still, I am trying to wrap my head around data.table functions, particularly .SD and want to solve the issue with these tools.

So far I know

  • that I can indicate what I mean by first and last using setkey(dt,date)
  • that I can replace the last val2 of a subset with the sum dt[, val2 := replace(val2, .N, sum(val2[-.N], na.rm = TRUE)), by=id]
  • that I get the length of a subset with [.N]
  • how to delete rows
  • that I can calculate the difference between two dates with difftime(strptime(dt$date[1],format ="%d.%m.%Y"),strptime(dt$date[2],format ="%d.%m.%Y"),units="weeks")

However I can't get my head around how to subset the observations such that each subset contains only groups of observations of the same id with dates of (consecutive) distance at max 2 weeks.

Any help is appreciated. Many thanks in advance.

1

1 Answers

0
votes

The trick is to use cumsum() on a condition. In this case, the condition is being more than 14 days. When the condition is true, the cumulative sum increments.

df %>%
  mutate(rownumber = row_number()) %>%
  group_by(id) %>%
  mutate(interval = as.numeric(as.Date(date, format = "%d.%m.%Y") - as.Date(lag(date), format = "%d.%m.%Y"))) %>%
  mutate(interval = ifelse(is.na(interval), 0, interval)) %>%
  mutate(group = cumsum(interval > 14) + 1) %>%
  ungroup() %>%
  group_by(id, group) %>%
  summarise(
    rownumber = last(rownumber),
    date = last(date),
    val1 = last(val1),
    val2 = sum(val2),
    counter = n()
  ) %>%
  select(rownumber, date, id, val1, val2, counter)

Output

  rownumber date          id  val1  val2 counter
      <int> <chr>      <int> <int> <int>   <int>
1         2 09.01.2000     1     4    19       2
2         3 01.08.2000     1     3     8       1
3         4 01.01.2000     2     2     7       1
4         7 28.01.2000     3     8    15       3
5         8 01.06.2000     3     9     3       1