0
votes

I am attempting to count the number of rows until a condition is reached in a grouped data frame. I have attempted to adapt the solution here but this does not seem to work with groups.

Sample data:

grp <- c(rep(1:2, each = 5), 3)
fromdate <- as.Date(c("2010-06-01", "2012-02-01", "2013-02-01", "2013-02-01", "2015-10-01", "2011-02-01", "2011-03-01", "2013-04-01", "2013-06-01", "2013-10-01", "2012-02-01"), origin = "1970-01-01")
todate <- as.Date(c("2016-12-31", "2013-01-31", "2015-10-31", "2015-12-31", "2016-01-31", "2013-02-28", "2013-02-28", "2013-09-30", "2016-12-31", "2017-01-31", "2014-01-31"), origin = "1970-01-01")
df <- data.frame(grp, fromdate, todate)

My ultimate goal is to have one line per continuous coverage period per group. To do that, I need to do the following: 1) Identify rows with dates that fall completely within a preceding row's dates (i.e. the fromdate is larger and the todate is smaller). I would then delete these date exclaves. 2) Identify when the current row's fromdate is less than the previous row's todate, i.e., there is overlapping coverage. I would then rewrite the first row's todate to be the latest todate in that period of continuous coverage and delete the other rows.

I have code to do 2) but am struggling with part 1).

My approach so far has been to sort by fromdate and search down todate until a larger todate is reached. This would then be the desired output:

grp   fromdate     todate      drop
 1    2010-06-01   2016-12-31  0
 1    2012-02-01   2013-01-31  1
 1    2013-02-01   2015-10-31  1
 1    2013-02-01   2015-12-31  1
 1    2015-10-01   2016-01-31  1
 2    2011-02-01   2013-02-28  0
 2    2011-03-01   2013-02-28  1
 2    2013-04-01   2013-09-30  0
 2    2013-06-01   2016-12-31  0
 2    2013-10-01   2017-01-31  0
 3    2012-02-01   2014-01-31  0

After applying part 2), the final df should be like this:

grp   fromdate     todate    
 1    2010-06-01   2016-12-31
 2    2011-02-01   2013-02-28
 2    2013-04-01   2017-01-31
 3    2012-02-01   2014-01-31

This works to count the number of rows until a larger date, but only on ungrouped data:

df <- df %>%
arrange(grp, fromdate, todate) %>%
mutate(rows_to_max = sapply(1:length(todate), 
      function(x) min(which(.$todate[x:length(.$todate)] > .$todate[x]))-1)) %>%
ungroup()

I would prefer to keep the solution compatible with dplyr but am open to other options.

Thanks in advance.

3
By "a preceding row" do you mean any preceding row, or only the directly preceding row? - user3603486
Any preceding row in that group. - GenericNameNumber
the code in the first box has stopped working - user3603486
also your answer looks nice and clear, maybe put it as an answer? - user3603486
I fixed up the dummy data code and it should work to create the third group - GenericNameNumber

3 Answers

0
votes

Using data.table::foverlap to match up rows and then collapse them iteratively.

grp <- rep(1:2, each = 5)
fromdate <- as.Date(c("2010-06-01", "2012-02-01", "2013-02-01", "2013-02-01", "2015-10-01", "2011-02-01", "2011-03-01", "2013-04-01", "2013-06-01", "2013-10-01"), origin = "1970-01-01")
todate <- as.Date(c("2016-12-31", "2013-01-31", "2015-10-31", "2015-12-31", "2016-01-31", "2013-02-28", "2013-02-28", "2013-09-30", "2016-12-31", "2017-01-31"), origin = "1970-01-01")
df <- data.frame(grp, fromdate, todate)

require(data.table)
setDT(df)
checklength <- 0

while (checklength != dim(df)[1]){

  # set our row count
  checklength <- dim(df)[1]

  # use data.table's foverlaps to match up rows
  setkey(df, grp, fromdate, todate)
  df <- foverlaps(df, df, mult = 'first')

  # collapse rows that have matched
  df[, todate   := pmax(todate, i.todate)]
  df[, fromdate := pmin(fromdate, i.fromdate)]
  df[, todate   := max(todate), .(grp, fromdate)]
  df[, fromdate := min(fromdate), .(grp, todate)]
  df <- unique(df[, .(grp, fromdate, todate)])
}

I can't think of a way to escape the iterative nature of this.

0
votes

Assuming you want to remove intervals contained in any preceding interval, lubridate is your friend:

library(lubridate)
df$int <- interval(df$fromdate, df$todate)
drop <- sapply(2:nrow(df),  function(x) {
    any(df$int[x] %within% df$int[1:(x-1)])
})
df$drop <- c(FALSE, drop) 

This doesn't yet deal with your need to do it by group. The following ought to work, but doesn't:

df %>% 
    group_by(grp) %>% 
    mutate(
      drop = c(FALSE, sapply(2:n(), function(x) any(int[x] %within% int[1:(x-1)])))
    )

Why not? I'm not sure but there's some quite hideous bug:

 tmp <- df %>% filter(grp==2)
 tmp

#    grp   fromdate     todate                            int
#  1   2 2011-02-01 2013-02-28 2010-06-01 UTC--2012-06-28 UTC
#  2   2 2011-03-01 2013-02-28 2012-02-01 UTC--2014-01-31 UTC   <<-  WTF???
#  3   2 2013-04-01 2013-09-30 2013-02-01 UTC--2013-08-02 UTC
#  4   2 2013-06-01 2016-12-31 2013-02-01 UTC--2016-09-02 UTC
#  5   2 2013-10-01 2017-01-31 2015-10-01 UTC--2019-01-31 UTC

So we'll avoid mixing intervals and grouped data frames. The penalty is some ugly multiple-square-brackets:

ivls <- interval(df$fromdate, df$todate)

df$idx <- 1:nrow(df)

df %>% 
  group_by(grp) %>% 
  mutate(
    drop = c(FALSE, sapply(2:n(), function(x) any(ivls[ idx[x] ] %within% ivls[ idx[1]:idx[x-1] ])))
  )

df

# Source: local data frame [10 x 5]
# Groups: grp [2]
# 
#      grp   fromdate     todate   idx  drop
#    <int>     <date>     <date> <int> <lgl>
# 1      1 2010-06-01 2016-12-31     1 FALSE
# 2      1 2012-02-01 2013-01-31     2  TRUE
# 3      1 2013-02-01 2015-10-31     3  TRUE
# 4      1 2013-02-01 2015-12-31     4  TRUE
# 5      1 2015-10-01 2016-01-31     5  TRUE
# 6      2 2011-02-01 2013-02-28     6 FALSE
# 7      2 2011-03-01 2013-02-28     7  TRUE
# 8      2 2013-04-01 2013-09-30     8 FALSE
# 9      2 2013-06-01 2016-12-31     9 FALSE
# 10     2 2013-10-01 2017-01-31    10 FALSE
0
votes

This is another way I attempted to solve this issue:

repeat {
  dfsize <-  nrow(df)
  df <- df%>%
    group_by(grp) %>%
    mutate(drop = ifelse((fromdate > lag(fromdate, 1) &
                            todate <= lag(todate, 1)) &
                           !is.na(lag(fromdate, 1)) &
                           !is.na(lag(todate, 1)),
                         1,
                         0
    )) %>%
    ungroup() %>%
    filter(drop == 0)
  dfsize2 <- nrow(df)
  if (dfsize2 == dfsize) {
    break
  }
}

It works efficiently on a subset of my data (at least up to ~100,000 rows and 38,000 groups). However, when I try to run it on 1.5m rows and 655,000 groups, it chugs along seemingly forever (until I abort). I end up repeating the mutate statement manually about 20 times.

Is this just a size of the data issue or is there a more efficient way to solve the problem?