Notes:
when doing similar/same things to multiple tables, I find it is almost always preferable to operate on them as a list of tables instead of individual objects; while this solution will work in general without this (some adaptation required), I believe it makes many things worth the mind-shift;
further, I actually think a long-format is better than a list-of-tables here, as we can still differentiate id
and sport
with ease;
your expected output is a little inconsistent in how it avoids overlap between rows; for example, "2000-01-14"
is not in the data, but it is the end_date
, suggesting that "2000-01-15"
was reduced because the next row starts on that date ... but there is a start on "2000-02-02"
for apparently similar (but reversed) reasons; one way around this is to subtract a really low number from end_date
so that no id/sport/date range will match multiple rows, and I say "low number" and not 1
because Date
-class objects are really numeric
, and dates can be fractional: though not displayed fractionally, it is still fractional, compare Sys.Date()-0.1
with dput(Sys.Date()-0.1)
.
sports <- rbindlist(mget(ls(pattern = "DT_sport.*")), idcol = "sport")
sports[, sport := gsub("^DT_", "", sport) ] # primarily aesthetics
# sport id start_date end_date
# <char> <int> <Date> <Date>
# 1: sportA 1 2000-01-01 2000-02-03
# 2: sportA 1 2002-01-15 2003-03-01
# 3: sportA 2 2014-03-12 2014-04-03
# 4: sportA 2 2016-10-14 2017-05-19
# 5: sportB 1 2000-01-15 2000-02-01
# 6: sportB 1 2002-01-15 2006-03-19
# 7: sportB 2 2017-02-10 2017-02-20
I tend to like piping data.table
, and since I'm still on R-4.0.5, I use magrittr::%>%
for this; it is not strictly required, but I feel it helps readability (and therefore maintainability, etc). (I don't know if this will work as easily in R-4.1's native |>
pipe, as that has more restrictions on the RHS data placement.)
library(magrittr)
out <- sports[, {
vec <- sort(unique(c(start_date, end_date)));
.(sd = vec[-length(vec)], ed = vec[-1]);
}, by = .(id) ] %>%
.[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
sports[., on = .(id, start_date <= sd, end_date >= ed) ] %>%
.[ !is.na(sport), ] %>%
.[, val := 1L ] %>%
dcast(id + start_date + end_date ~ sport, value.var = "val", fill = 0)
out
# id start_date end_date sportA sportB
# <int> <Date> <Date> <int> <int>
# 1: 1 2000-01-01 2000-01-14 1 0
# 2: 1 2000-01-15 2000-01-31 1 1
# 3: 1 2000-02-01 2000-02-02 1 0
# 4: 1 2002-01-15 2003-02-28 1 1
# 5: 1 2003-03-01 2006-03-19 0 1
# 6: 2 2014-03-12 2014-04-02 1 0
# 7: 2 2016-10-14 2017-02-09 1 0
# 8: 2 2017-02-10 2017-02-19 1 1
# 9: 2 2017-02-20 2017-05-19 1 0
Walk-through:
the first sports[, {...}]
produces just the feasible date-ranges, per-id
; it will produce more than needed, and these are filtered out a little later; I combine this with a slight offset to end_date
so that rows are mutually exclusive (second note above); while they appear to be full-days separated, they are only separated by under 1 second; I add secdiff
to show this here:
sports[, {
vec <- sort(unique(c(start_date, end_date)));
.(sd = vec[-length(vec)], ed = vec[-1]);
}, by = .(id) ] %>%
.[, ed := pmin(ed, shift(sd, type = "lead") - 1e-5, na.rm = TRUE), by = .(id) ] %>%
.[, secdiff := c(as.numeric(sd[-1] - ed[-.N], units="secs"), NA), by = .(id) ]
# id sd ed secdiff
# <int> <Date> <Date> <num>
# 1: 1 2000-01-01 2000-01-14 0.8640000
# 2: 1 2000-01-15 2000-01-31 0.8640000
# 3: 1 2000-02-01 2000-02-02 0.8640000
# 4: 1 2000-02-03 2002-01-14 0.8640000 # will be empty post-join
# 5: 1 2002-01-15 2003-02-28 0.8640000
# 6: 1 2003-03-01 2006-03-19 NA
# 7: 2 2014-03-12 2014-04-02 0.8640001
# 8: 2 2014-04-03 2016-10-13 0.8640001 # will be empty post-join
# 9: 2 2016-10-14 2017-02-09 0.8640001
# 10: 2 2017-02-10 2017-02-19 0.8640001
# 11: 2 2017-02-20 2017-05-19 NA
btw, the first operation on sports[..]
in the previous bullet is {
-blockized for a slight boost in efficiency, choosing to not sort(unique(c(start_date, end_date)))
twice;
left join sports
onto this, on id
and the date-ranges; this will produce NA
values in the sport
column, which indicates the date ranges that were programmatically made (with a simple sequence of dates) but no sports are assigned; these not-needed rows are removed by the !is.na(sport)
;
assigning val := 1L
is purely so that we have a value column during reshaping;
dcast
reshapes and fills the missing values with 0
.