Split date sequence into one chunk (containing start & end date) for each month

Question

Let's say I have a dataframe like the one below:

df <- data.frame(group = c("a", "a", "b"),
                 start = as.Date(c("2018-01-01", "2018-09-01", "2018-02-01")),
                 end = as.Date(c("2018-02-15", "2018-12-31", "2018-03-30")))

group      start        end
     a 2018-01-01 2018-02-15
     a 2018-09-01 2018-12-31
     b 2018-02-01 2018-03-30

And I would like to get the following expected output:

output <- data.frame(group = c("a", "a", "a", "a", "a", "a", "b", "b"),
                  start = as.Date(c("2018-01-01", "2018-02-01", "2018-09-01",
                                    "2018-10-01", "2018-11-01", "2018-12-01",
                                    "2018-02-01", "2018-03-01")),
                  end = as.Date(c("2018-01-31", "2018-02-15", "2018-09-30",
                                  "2018-10-31", "2018-11-30", "2018-12-31",
                                  "2018-02-28", "2018-03-30")))

 group      start        end
     a 2018-01-01 2018-01-31
     a 2018-02-01 2018-02-15
     a 2018-09-01 2018-09-30
     a 2018-10-01 2018-10-31
     a 2018-11-01 2018-11-30
     a 2018-12-01 2018-12-31
     b 2018-02-01 2018-02-28
     b 2018-03-01 2018-03-30

For each month within the sequence I would like to get a separate row which would be delimited by the 1) start date of the sequence if the latter > than the beginning of the month or beginning of month & 2) end date of the month if the latter > end date of the sequence or end date of the sequence.

Any ideas on how to do this?

Wimpel Wimpel · Accepted Answer · 2018-09-12T18:16:18

data.table solution

My favourite weapon of choice for these kind of problems is data.table's very very fast foverlaps

df <- data.frame(group = c("a", "a", "b"),
                 start = as.Date(c("2018-01-01", "2018-09-01", "2018-02-01")),
                 end = as.Date(c("2018-02-15", "2018-12-31", "2018-03-30")))

#create data-frame with from-to by month
df2 <- data.frame( start = seq( as.Date("2018-01-01"), length = 12, by = "1 month" ),
                   end = seq( as.Date( "2018-02-01"), length = 12, by= "1 month" ) - 1,
                   stringsAsFactors = FALSE )

library(data.table)

#setDT on both data.frames... df2 needs to be keyed in order for foverlaps to work.
dt <- foverlaps( setDT( df ), setDT( df2, key = c("start", "end") ), type = "any", mult = "all" )[]
#choose keep the right columns (start/end)
dt[ start < i.start, start := i.start ]
dt[ end > i.end, end := i.end ]
#cleaning
dt[, `:=`(i.start = NULL, i.end = NULL)][]

 #         start        end group
# 1: 2018-01-01 2018-01-31     a
# 2: 2018-02-01 2018-02-15     a
# 3: 2018-09-01 2018-09-30     a
# 4: 2018-10-01 2018-10-31     a
# 5: 2018-11-01 2018-11-30     a
# 6: 2018-12-01 2018-12-31     a
# 7: 2018-02-01 2018-02-28     b
# 8: 2018-03-01 2018-03-30     b

benchmarks

Compared to @AntoniosK's tidyverse solution (which works just as good, and is more readable ;-) ), foverlaps does the job in 50% of the time

# Unit: milliseconds
# expr       min       lq      mean    median        uq       max neval
# tidyverse 10.418585 10.79064 12.531207 11.080309 11.753030 93.110804   100
# foverlaps  5.320911  5.59506  5.861865  5.846766  6.009146  9.606981   100

Split date sequence into one chunk (containing start & end date) for each month

3 Answers

data.table solution

benchmarks