1
votes

Packages I'm working with:

  • dplyr
  • lubridate

Starting data set I'm working with:

Employee_ID      Start_Date      End_Date
1                2/1/2013        12/31/9999
2                5/14/2013       10/5/2017
3                9/7/2014        8/30/2017

Data set I'm looking to create from the starting data set:

Date             Active_Employee_Count
12/31/2013       2
12/31/2014       3
12/31/2015       3
12/31/2016       3
12/31/2017       1

Explanation

I'd like to be able to get the counts of active employees on December 31st of each year. I'm using the dplyr and lubridate packages to try to do this, but I'm struggling. I know of the filter, group_by, summarise, mutate, and distinct functions, but I'm not sure these are sufficient to do this task. The main challenge I'm facing is how do I somehow work into the process these December 31st dates? Do I need to create some sort of array of dates (12/31/2013, 12/31/2014, etc) and then somehow use that to do this?

2

2 Answers

1
votes

So, I changed the ID of the employees to actual names.

df = data.frame(Employee_ID = c("John", "Hugh", "Jack"),
                Start_Date = mdy(c(02012013, 05142013, 09072014)),
                End_Date = mdy(c(12319999, 10052017, 08302017)))  

Years_End2 = data.frame(Employee_ID = rep(df$Employee_ID, each = dim(Years_End)[1]),
                        Years_End = seq(from = mdy(12312013), to = mdy(12312017), by = "year"))

df %>% left_join(Years_End2, by = "Employee_ID") %>%
      mutate(Active = (End_Date > Years_End & Start_Date < Years_End)) %>%
      group_by(Years_End) %>%
      summarise(sum(Active))

  Years_End  `sum(Active)`
  <date>             <int>
1 2013-12-31             2
2 2014-12-31             3
3 2015-12-31             3
4 2016-12-31             3
5 2017-12-31             1
1
votes

You can do this pretty concisely using tidyr::expand() and the %within% and %--% operators from lubridate. %--% essentially wraps lubridate::interval(), and you can use %within% to test whether a date falls within an interval.

df <- data.frame(Employee_ID = c(1, 2, 3),
                Start_Date = mdy(c(02012013, 05142013, 09072014)),
                End_Date = mdy(c(12319999, 10052017, 08302017)))

df %>% 
  mutate(year = Start_Date) %>%
  expand(nesting(Employee_ID, Start_Date, End_Date), 
         year = seq.Date(from = ymd("2013-12-31"), to = ymd("2017-12-31"), by = "year")) %>%
  mutate(inrange = year %within% (Start_Date %--% End_Date)) %>%
  group_by(year) %>%
  summarize(active_employee_count = sum(inrange))

# A tibble: 5 x 2
  year       active_employee_count
  <date>                     <int>
1 2013-12-31                     2
2 2014-12-31                     3
3 2015-12-31                     3
4 2016-12-31                     3
5 2017-12-31                     1