cumsum with dates in R

Question

I would like to construct a "cumulative sum", counting the number of observations in a group that have ended before the observation in question started.

I prefer answers using dplyr, but prioritize low memory overhead as this is a huge dataset.

MWE below, where the variable I'd like to create is called "prior_ended_obs".

mwe <- data.frame(group = c("a", "a", "a", "a", "a", "a", "a", "a", "a", "b", "b", "b"),
                    start = c("1/1/1990", "1/2/1990", "1/3/1990", "1/3/1990", "1/3/1990", "1/4/1990", "1/5/1990", "1/6/1990", "1/7/1990", "1/1/1990", "1/2/1990", "1/3/1990"),
                    end   = c("1/2/1990", "1/2/1990", "1/3/1990", "1/4/1990", "1/5/1990", "1/5/1990", "1/5/1990", "1/6/1990", "1/8/1990", "1/1/1990", "1/2/1990", "1/3/1990"),
                    prior_ended_obs = c(0, 0, 2, 2, 2, 3, 4, 7, 8, 0, 1, 2)) %>%
  mutate(start = mdy(start),
         end = mdy(end)) %>%
  group_by(group) %>%
  mutate(lag_end = lag(end),
         ones = 1,
         cumsum = cumsum(ones)-1)

Sorry, I don't understand what you're trying to do. I think I'm stuck on the "counting the number of observations in a group that have ended before this one started" part. Not sure what "this one" means, or how you're determining sequence when so many rows have overlapping dates. Would you please try to clarify? — ulfelder
As @ulfelder mentioned, it is a bit confusing. Can you please include more description — akrun
@GoGonzo can you use the mwe to explain how sum_run would work? — grapes_or_raisins

arg0naut91 arg0naut91 · Accepted Answer · 2020-02-03T21:44:36

Try:

mwe <- data.frame(group = c("a", "a", "a", "a", "a", "a", "a", "a", "a", "b", "b", "b"),
                  start = c("1/1/1990", "1/2/1990", "1/3/1990", "1/3/1990", "1/3/1990", "1/4/1990", "1/5/1990", "1/6/1990", "1/7/1990", "1/1/1990", "1/2/1990", "1/3/1990"),
                  end   = c("1/2/1990", "1/2/1990", "1/3/1990", "1/4/1990", "1/5/1990", "1/5/1990", "1/5/1990", "1/6/1990", "1/8/1990", "1/1/1990", "1/2/1990", "1/3/1990"),
                  prior_ended_obs = c(0, 0, 2, 2, 2, 3, 4, 7, 8, 0, 1, 2)) %>%
  mutate(start = mdy(start),
         end = mdy(end)) %>%
  group_by(group) %>%
  mutate(
    prior_ended_obs = sapply(1:n(), function(x) sum(end[1:(x-1)] < start[x], na.rm = TRUE))
  )

Output:

# A tibble: 12 x 4
# Groups:   group [2]
   group start      end        prior_ended_obs
   <fct> <date>     <date>               <int>
 1 a     1990-01-01 1990-01-02               0
 2 a     1990-01-02 1990-01-02               0
 3 a     1990-01-03 1990-01-03               2
 4 a     1990-01-03 1990-01-04               2
 5 a     1990-01-03 1990-01-05               2
 6 a     1990-01-04 1990-01-05               3
 7 a     1990-01-05 1990-01-05               4
 8 a     1990-01-06 1990-01-06               7
 9 a     1990-01-07 1990-01-08               8
10 b     1990-01-01 1990-01-01               0
11 b     1990-01-02 1990-01-02               1
12 b     1990-01-03 1990-01-03               2

cumsum with dates in R

1 Answers