2
votes

Let's say I have a data.frame consisting of industry type and starting and ending dates (e.g. for an employee).

mydf <- data.frame(industry = c("Government", "Education", "Military", "Private Sector", "Government", "Private Sector"),
                   start_date = c("2014-01-01", "2016-02-01", "2012-11-01", "2013-03-01", "2012-12-01", "2011-12-01"),
                   end_date = c("2020-12-01", "2016-10-01", "2014-01-01", "2016-10-01", "2015-10-01", "2014-09-01"))

> mydf
        industry start_date   end_date
1     Government 2014-01-01 2020-12-01
2      Education 2016-02-01 2016-10-01
3       Military 2012-11-01 2014-01-01
4 Private Sector 2013-03-01 2016-10-01
5     Government 2012-12-01 2015-10-01
6 Private Sector 2011-12-01 2014-09-01

I'd like to create a stacked ggplot bar chart in which each unique year in the start_date column is on the X axis (e.g. 2011-2016) and the y axis represents the total number of observations (the row count) represented in a given industry for that year.

I'm not sure what the right way to manipulate the data.frame to allow for this. Presumably I'd need to manipulate the data to have columns for industry year and count. But I'm not sure how to produce the year columns from a date range. Any ideas?

2

2 Answers

3
votes

Convert the date columns to Date, create the 'date' sequence from the 'start_date' to 'end_date' for each row with map2 (from purrr), unnest the list output, count the year and plot with geom_bar

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
mydf %>%
   mutate(across(c(start_date, end_date), as.Date)) %>% 
   transmute(industry, date = map2(start_date, end_date, seq, by = 'day')) %>% 
   unnest(c(date)) %>% 
   count(industry, year = factor(year(date))) %>%
   ggplot(aes(x = year, y = n, fill = industry)) + 
        geom_col() +
        theme_bw()

If the plot should be separate for each 'industry'

mydf %>%
   mutate(across(c(start_date, end_date), as.Date)) %>% 
   transmute(industry, date = map2(start_date, end_date, seq, by = 'day')) %>% 
   unnest(c(date)) %>% 
   count(industry, year = factor(year(date))) %>%
   ggplot(aes(x = year, y = n, fill = industry)) + 
        geom_col() + 
        facet_wrap(~ industry) +
        theme_bw()

-output

enter image description here


As @IanCampbell suggested, the by for seq can be 'year'

mydf %>%
   mutate(across(c(start_date, end_date), as.Date)) %>% 
   transmute(industry, date = map2(start_date, end_date, seq, by = 'year')) %>% 
   unnest(c(date)) %>% 
   count(industry, year = factor(year(date))) %>%
   ggplot(aes(x = year, y = n, fill = industry)) + 
        geom_col() + 
        facet_wrap(~ industry) +
        theme_bw()
2
votes

Is this what you're looking for? I would recommend using purrr::pmap to create a new data frame with one row for each year based on each row of the original data.

We can use the purrr::pmap_dfr to automatically return a single data frame bound by row.

We can use the ~with(list(...), ) trick to be able to reference columns by name.

Then we can use dplyr::count to count by combinations of columns. Then it's easy.

library(dplyr)
library(purrr)
library(lubridate)
library(ggplot)
mydf %>%
  mutate(across(c(start_date, end_date), as.Date),
         start_year = year(start_date),
         end_year = year(end_date)) %>%
  pmap_dfr(~with(list(...),data.frame(industry,
                                      year = seq(start_year, end_year)))) %>%
  count(year, industry) %>%
ggplot(aes(x = year, y = n, fill = industry)) + 
  geom_bar(stat="identity")

enter image description here