I am attempting to count the number of rows until a condition is reached in a grouped data frame. I have attempted to adapt the solution here but this does not seem to work with groups.
Sample data:
grp <- c(rep(1:2, each = 5), 3)
fromdate <- as.Date(c("2010-06-01", "2012-02-01", "2013-02-01", "2013-02-01", "2015-10-01", "2011-02-01", "2011-03-01", "2013-04-01", "2013-06-01", "2013-10-01", "2012-02-01"), origin = "1970-01-01")
todate <- as.Date(c("2016-12-31", "2013-01-31", "2015-10-31", "2015-12-31", "2016-01-31", "2013-02-28", "2013-02-28", "2013-09-30", "2016-12-31", "2017-01-31", "2014-01-31"), origin = "1970-01-01")
df <- data.frame(grp, fromdate, todate)
My ultimate goal is to have one line per continuous coverage period per group. To do that, I need to do the following: 1) Identify rows with dates that fall completely within a preceding row's dates (i.e. the fromdate is larger and the todate is smaller). I would then delete these date exclaves. 2) Identify when the current row's fromdate is less than the previous row's todate, i.e., there is overlapping coverage. I would then rewrite the first row's todate to be the latest todate in that period of continuous coverage and delete the other rows.
I have code to do 2) but am struggling with part 1).
My approach so far has been to sort by fromdate and search down todate until a larger todate is reached. This would then be the desired output:
grp fromdate todate drop
1 2010-06-01 2016-12-31 0
1 2012-02-01 2013-01-31 1
1 2013-02-01 2015-10-31 1
1 2013-02-01 2015-12-31 1
1 2015-10-01 2016-01-31 1
2 2011-02-01 2013-02-28 0
2 2011-03-01 2013-02-28 1
2 2013-04-01 2013-09-30 0
2 2013-06-01 2016-12-31 0
2 2013-10-01 2017-01-31 0
3 2012-02-01 2014-01-31 0
After applying part 2), the final df should be like this:
grp fromdate todate
1 2010-06-01 2016-12-31
2 2011-02-01 2013-02-28
2 2013-04-01 2017-01-31
3 2012-02-01 2014-01-31
This works to count the number of rows until a larger date, but only on ungrouped data:
df <- df %>%
arrange(grp, fromdate, todate) %>%
mutate(rows_to_max = sapply(1:length(todate),
function(x) min(which(.$todate[x:length(.$todate)] > .$todate[x]))-1)) %>%
ungroup()
I would prefer to keep the solution compatible with dplyr but am open to other options.
Thanks in advance.