R count rows until a condition is reached by group

Question

I am attempting to count the number of rows until a condition is reached in a grouped data frame. I have attempted to adapt the solution here but this does not seem to work with groups.

Sample data:

grp <- c(rep(1:2, each = 5), 3)
fromdate <- as.Date(c("2010-06-01", "2012-02-01", "2013-02-01", "2013-02-01", "2015-10-01", "2011-02-01", "2011-03-01", "2013-04-01", "2013-06-01", "2013-10-01", "2012-02-01"), origin = "1970-01-01")
todate <- as.Date(c("2016-12-31", "2013-01-31", "2015-10-31", "2015-12-31", "2016-01-31", "2013-02-28", "2013-02-28", "2013-09-30", "2016-12-31", "2017-01-31", "2014-01-31"), origin = "1970-01-01")
df <- data.frame(grp, fromdate, todate)

My ultimate goal is to have one line per continuous coverage period per group. To do that, I need to do the following: 1) Identify rows with dates that fall completely within a preceding row's dates (i.e. the fromdate is larger and the todate is smaller). I would then delete these date exclaves. 2) Identify when the current row's fromdate is less than the previous row's todate, i.e., there is overlapping coverage. I would then rewrite the first row's todate to be the latest todate in that period of continuous coverage and delete the other rows.

I have code to do 2) but am struggling with part 1).

My approach so far has been to sort by fromdate and search down todate until a larger todate is reached. This would then be the desired output:

grp   fromdate     todate      drop
 1    2010-06-01   2016-12-31  0
 1    2012-02-01   2013-01-31  1
 1    2013-02-01   2015-10-31  1
 1    2013-02-01   2015-12-31  1
 1    2015-10-01   2016-01-31  1
 2    2011-02-01   2013-02-28  0
 2    2011-03-01   2013-02-28  1
 2    2013-04-01   2013-09-30  0
 2    2013-06-01   2016-12-31  0
 2    2013-10-01   2017-01-31  0
 3    2012-02-01   2014-01-31  0

After applying part 2), the final df should be like this:

grp   fromdate     todate    
 1    2010-06-01   2016-12-31
 2    2011-02-01   2013-02-28
 2    2013-04-01   2017-01-31
 3    2012-02-01   2014-01-31

This works to count the number of rows until a larger date, but only on ungrouped data:

df <- df %>%
arrange(grp, fromdate, todate) %>%
mutate(rows_to_max = sapply(1:length(todate), 
      function(x) min(which(.$todate[x:length(.$todate)] > .$todate[x]))-1)) %>%
ungroup()

I would prefer to keep the solution compatible with dplyr but am open to other options.

Thanks in advance.

By "a preceding row" do you mean any preceding row, or only the directly preceding row? — user3603486
also your answer looks nice and clear, maybe put it as an answer? — user3603486
I fixed up the dummy data code and it should work to create the third group — GenericNameNumber

Adam Hoelscher Adam Hoelscher · Accepted Answer · 2016-07-20T00:41:37

Using data.table::foverlap to match up rows and then collapse them iteratively.

grp <- rep(1:2, each = 5)
fromdate <- as.Date(c("2010-06-01", "2012-02-01", "2013-02-01", "2013-02-01", "2015-10-01", "2011-02-01", "2011-03-01", "2013-04-01", "2013-06-01", "2013-10-01"), origin = "1970-01-01")
todate <- as.Date(c("2016-12-31", "2013-01-31", "2015-10-31", "2015-12-31", "2016-01-31", "2013-02-28", "2013-02-28", "2013-09-30", "2016-12-31", "2017-01-31"), origin = "1970-01-01")
df <- data.frame(grp, fromdate, todate)

require(data.table)
setDT(df)
checklength <- 0

while (checklength != dim(df)[1]){

  # set our row count
  checklength <- dim(df)[1]

  # use data.table's foverlaps to match up rows
  setkey(df, grp, fromdate, todate)
  df <- foverlaps(df, df, mult = 'first')

  # collapse rows that have matched
  df[, todate   := pmax(todate, i.todate)]
  df[, fromdate := pmin(fromdate, i.fromdate)]
  df[, todate   := max(todate), .(grp, fromdate)]
  df[, fromdate := min(fromdate), .(grp, todate)]
  df <- unique(df[, .(grp, fromdate, todate)])
}

I can't think of a way to escape the iterative nature of this.

R count rows until a condition is reached by group

3 Answers