2
votes

Suppose that I have two tables (DT_sportA and DT_sportB) that measure time periods in which two children (id) played sport "A" and "B".

library(data.table)
library(lubridate)

DT_sportA <- data.table(id = rep(1:2,each=2),
                start_date = ymd(c("2000-01-01","2002-01-15","2014-03-12","2016-10-14")),
                end_date = ymd(c("2000-02-03","2003-03-01","2014-04-03","2017-05-19")))
DT_sportA
#    id start_date   end_date
# 1:  1 2000-01-01 2000-02-03
# 2:  1 2002-01-15 2003-03-01
# 3:  2 2014-03-12 2014-04-03
# 4:  2 2016-10-14 2017-05-19


DT_sportB <- data.table(id = c(1L,1L,2L),
                        start_date = ymd(c("2000-01-15","2002-01-15","2017-02-10")),
                        end_date = ymd(c("2000-02-01","2006-03-19","2017-02-20")))

DT_sportB
#    id start_date   end_date
# 1:  1 2000-01-15 2000-02-01
# 2:  1 2002-01-15 2006-03-19
# 3:  2 2017-02-10 2017-02-20

I would like to generate a new table with all of the unique and overlapping date ranges with two categorical indicators denoting the sport played by the children. The desired DT should look like this:

   id start_date   end_date sportA sportB
1:  1 2000-01-01 2000-01-14      1      0
2:  1 2000-01-15 2000-02-01      1      1
3:  1 2000-02-02 2000-02-03      1      0
4:  1 2002-01-15 2002-03-01      1      1
5:  1 2002-03-02 2002-03-19      0      1
6:  2 2014-03-12 2014-04-03      1      0
7:  2 2016-10-14 2017-02-09      1      0
8:  2 2017-02-10 2017-02-20      1      1
9:  2 2017-02-21 2017-05-19      1      0

This is a fairly trivial toy example. The real data spans several million rows and approximately 20 "sports", which is why I am looking for a data.table solution.

1

1 Answers

2
votes

Notes:

  • when doing similar/same things to multiple tables, I find it is almost always preferable to operate on them as a list of tables instead of individual objects; while this solution will work in general without this (some adaptation required), I believe it makes many things worth the mind-shift;

  • further, I actually think a long-format is better than a list-of-tables here, as we can still differentiate id and sport with ease;

  • your expected output is a little inconsistent in how it avoids overlap between rows; for example, "2000-01-14" is not in the data, but it is the end_date, suggesting that "2000-01-15" was reduced because the next row starts on that date ... but there is a start on "2000-02-02" for apparently similar (but reversed) reasons; one way around this is to subtract a really low number from end_date so that no id/sport/date range will match multiple rows, and I say "low number" and not 1 because Date-class objects are really numeric, and dates can be fractional: though not displayed fractionally, it is still fractional, compare Sys.Date()-0.1 with dput(Sys.Date()-0.1).

sports <- rbindlist(mget(ls(pattern = "DT_sport.*")), idcol = "sport")
sports[, sport := gsub("^DT_", "", sport) ] # primarily aesthetics
#     sport    id start_date   end_date
#    <char> <int>     <Date>     <Date>
# 1: sportA     1 2000-01-01 2000-02-03
# 2: sportA     1 2002-01-15 2003-03-01
# 3: sportA     2 2014-03-12 2014-04-03
# 4: sportA     2 2016-10-14 2017-05-19
# 5: sportB     1 2000-01-15 2000-02-01
# 6: sportB     1 2002-01-15 2006-03-19
# 7: sportB     2 2017-02-10 2017-02-20

I tend to like piping data.table, and since I'm still on R-4.0.5, I use magrittr::%>% for this; it is not strictly required, but I feel it helps readability (and therefore maintainability, etc). (I don't know if this will work as easily in R-4.1's native |> pipe, as that has more restrictions on the RHS data placement.)

library(magrittr)
out <- sports[, {
  vec <- sort(unique(c(start_date, end_date)));
  .(sd = vec[-length(vec)], ed = vec[-1]);
}, by = .(id) ] %>%
  .[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
  sports[., on = .(id, start_date <= sd, end_date >= ed) ] %>%
  .[ !is.na(sport), ] %>%
  .[, val := 1L ] %>%
  dcast(id + start_date + end_date ~ sport, value.var = "val", fill = 0)
out
#       id start_date   end_date sportA sportB
#    <int>     <Date>     <Date>  <int>  <int>
# 1:     1 2000-01-01 2000-01-14      1      0
# 2:     1 2000-01-15 2000-01-31      1      1
# 3:     1 2000-02-01 2000-02-02      1      0
# 4:     1 2002-01-15 2003-02-28      1      1
# 5:     1 2003-03-01 2006-03-19      0      1
# 6:     2 2014-03-12 2014-04-02      1      0
# 7:     2 2016-10-14 2017-02-09      1      0
# 8:     2 2017-02-10 2017-02-19      1      1
# 9:     2 2017-02-20 2017-05-19      1      0

Walk-through:

  • the first sports[, {...}] produces just the feasible date-ranges, per-id; it will produce more than needed, and these are filtered out a little later; I combine this with a slight offset to end_date so that rows are mutually exclusive (second note above); while they appear to be full-days separated, they are only separated by under 1 second; I add secdiff to show this here:

    sports[, {
      vec <- sort(unique(c(start_date, end_date)));
      .(sd = vec[-length(vec)], ed = vec[-1]);
    }, by = .(id) ] %>%
      .[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
      .[, secdiff := c(as.numeric(sd[-1] - ed[-.N], units="secs"), NA), by = .(id) ]
    #        id         sd         ed   secdiff
    #     <int>     <Date>     <Date>     <num>
    #  1:     1 2000-01-01 2000-01-14 0.8640000
    #  2:     1 2000-01-15 2000-01-31 0.8640000
    #  3:     1 2000-02-01 2000-02-02 0.8640000
    #  4:     1 2000-02-03 2002-01-14 0.8640000  # will be empty post-join
    #  5:     1 2002-01-15 2003-02-28 0.8640000
    #  6:     1 2003-03-01 2006-03-19        NA
    #  7:     2 2014-03-12 2014-04-02 0.8640001
    #  8:     2 2014-04-03 2016-10-13 0.8640001  # will be empty post-join
    #  9:     2 2016-10-14 2017-02-09 0.8640001
    # 10:     2 2017-02-10 2017-02-19 0.8640001
    # 11:     2 2017-02-20 2017-05-19        NA
    
  • btw, the first operation on sports[..] in the previous bullet is {-blockized for a slight boost in efficiency, choosing to not sort(unique(c(start_date, end_date))) twice;

  • left join sports onto this, on id and the date-ranges; this will produce NA values in the sport column, which indicates the date ranges that were programmatically made (with a simple sequence of dates) but no sports are assigned; these not-needed rows are removed by the !is.na(sport);

  • assigning val := 1L is purely so that we have a value column during reshaping;

  • dcast reshapes and fills the missing values with 0.