5
votes

I have data on different persons (ID), the dates they have worked (Date), and how many hours they have worked each date (Hours).

Monday - Friday are considered working days. For each ID, I want to group consecutive working days. Weekends and holidays should be omitted when deciding if working days are consecutive.

Some examples:

If a person worked Monday, Tuesday and Wednesday, then skipped Thursday, and worked again on Friday, then Monday - Wednesday is considered one group and Friday another group.

If a person worked Thursday & Friday one week and Monday & Tuesday the next week, then these four days should be in the same group. Thus, the days in the weekend are omitted when checking if working days are consecutive.

If Monday - Friday of one week and Monday - Friday of the next week, then I'd count those whole two weeks as consecutive working days.

I would also like to take into account common US Holidays (e.g. New Year), such that 12/31/2020 to 1/4/2021 would still count as consecutive workdays.

Once the grouping variable is created, I want to sum the working hours in each group.

Example data:

df1 = structure(list(ID = c(1, 1, 1, 1, 2, 2, 3, 3, 
3, 3), Date = structure(c(18781, 18782, 18785, 18750, 18687, 
18688, 18626, 18627, 18631, 18634), class = "Date"), Hours = c(8, 
8, 8, 16, 8, 8, 8, 8, 8, 8)), row.names = c(NA, -10L), class = "data.frame")


  ID       Date Hours
1  1 2021-06-03     8
2  1 2021-06-04     8
3  1 2021-06-07     8
4  1 2021-05-03    16
5  2 2021-03-01     8
6  2 2021-03-02     8
7  3 2020-12-30     8
8  3 2020-12-31     8
9  3 2021-01-04     8
10 3 2021-01-07     8

I imagine my output to look something like this:

 ID      Date1      Date2 Hours
1 1 2021-06-03 2021-06-07    24
# the weekend, June 5-6, is omitted
# when the group of consecutive working days is created
 
2 1 2021-05-03 2021-05-03    16
3 2 2021-03-01 2021-03-02    16

4 3 2020-12-30 2021-01-04    24
# the public holiday (Jan 1) and the weekend (Jan 2-3) are omitted
   
5 3 2021-01-07 2021-01-07     8

My top priority is to at least get the consecutive work week figured out, the holiday portion would be an added bonus.

1

1 Answers

0
votes

You may use RQuantLib::businessDaysBetween. For each ID (by = ID), calculate the number of business days between each row, i.e. provide vectors of "lag" (head(Date, -1)) and "lead" (tail(Date, -1)) as from and to dates. Pick a relevant calendar (see Details in ?businessDaysBetween)

For each ID and run of consecutive business days (by = .(ID, g = cumsum(d != 1L))), select the first and last date (from = Date[1], to = Date[.N]) and sum the Hours (sum(Hours))

library(data.table)
library(RQuantLib)
setDT(df1)

df1[ , d := c(1, businessDaysBetween(calendar = "UnitedStates",
                                     from = head(Date, -1), to = tail(Date, -1))),
    by = ID]

df1[ , .(from = Date[1], to = Date[.N], Hours = sum(Hours)),
    by = .(ID, g = cumsum(d != 1L))]

#    ID g       from         to Hours
# 1:  1 0 2021-06-03 2021-06-07    24
# 2:  1 1 2021-05-03 2021-05-03    16
# 3:  2 1 2021-03-01 2021-03-02    16
# 4:  3 1 2020-12-30 2021-01-04    24
# 5:  3 2 2021-01-07 2021-01-07     8

More convoluted solution (pre-businessDaysBetween):

Create a full sequence of dates within each ID (df1[ , .(Date = seq(min(Date), max(Date), by = "1 day")), by = ID]). Join with original data on ID and Date (df1[..., on =.(ID, Date)). For dates not present in original data, i.e. gaps between (originally) consecutive days, Hours will be NA.

Within each ID (by = ID), create a run length index based on missing Hours (rleid(is.na(Hours))). For rows with missing Hours (d[is.na(Hours)), i.e. the gaps in the original time series, for each ID and run (by = .(ID, r)), check if all dates are either a weekend (wday(Date) %in% c(1, 7)) or (|) a public holiday* (Date %in% as.Date(holidayNYSE(unique(year(Date))))), and create an index variable, ix.

For original rows and weekend/holiday gaps (!is.na(Hours) | ix), create a grouping variable of consecutive dates (g = cumsum(c(TRUE, diff(Date) != 1L))). For each ID and run of consecutive dates (by = .(ID, g)), select the first and last date (from = Date[1], to = Date[.N]) and sum the Hours (sum(Hours, na.rm = TRUE))

library(data.table)
library(timeDate)
setDT(df1)

d = df1[df1[ , .(Date = seq(min(Date), max(Date), by = "1 day")), by = ID],
    on = .(ID, Date)]

d[ , r := rleid(is.na(Hours)), by = ID]

d[is.na(Hours), ix := all(
    wday(Date) %in% c(1, 7) |
      Date %in% as.Date(holidayNYSE(unique(year(Date)))))
, by = .(ID, r)]


d[!is.na(Hours) | ix, .(Date, Hours, g = cumsum(c(TRUE, diff(Date) != 1L))),
  by = ID][
  , .(from = Date[1], to = Date[.N],
    Hours = sum(Hours, na.rm = TRUE)),
  by = .(ID, g)] 

#    ID g       from         to Hours
# 1:  1 1 2021-05-03 2021-05-03    16
# 2:  1 2 2021-06-03 2021-06-07    24
# 3:  2 1 2021-03-01 2021-03-02    16
# 4:  3 1 2020-12-30 2021-01-04    24
# 5:  3 2 2021-01-07 2021-01-07     8

*Please see the timeDate manual for other definitions of holiday.