Aggregate a tibble based on a consecutive values in a boolean column

Question

I've got a fairly straight-forward problem, but I'm struggling to find a solution that doesn't require a wall of code and complicated loops.

I've got a summary table, df, for an hourly timeseries dataset where each observations belongs to a group. I want to merge some of those groups, based on a boolean column in the summary table. The boolean column, merge_with_next indicates whether a given group should be merged with the next group (one row down). The merging effectively occurs by updating the end, value and removing rows:

library(dplyr)

# Demo data
df <- tibble(
  group = 1:12,
  start = seq.POSIXt(as.POSIXct("2019-01-01 00:00"), as.POSIXct("2019-01-12 00:00"), by = "1 day"),
  end = seq.POSIXt(as.POSIXct("2019-01-01 23:59"), as.POSIXct("2019-01-12 23:59"), by = "1 day"), 
  merge_with_next = rep(c(TRUE, TRUE, FALSE), 4)
)

df
#> # A tibble: 12 x 4
#>    group start               end                 merge_with_next
#>    <int> <dttm>              <dttm>              <lgl>          
#>  1     1 2019-01-01 00:00:00 2019-01-01 23:59:00 TRUE           
#>  2     2 2019-01-02 00:00:00 2019-01-02 23:59:00 TRUE           
#>  3     3 2019-01-03 00:00:00 2019-01-03 23:59:00 FALSE          
#>  4     4 2019-01-04 00:00:00 2019-01-04 23:59:00 TRUE           
#>  5     5 2019-01-05 00:00:00 2019-01-05 23:59:00 TRUE           
#>  6     6 2019-01-06 00:00:00 2019-01-06 23:59:00 FALSE          
#>  7     7 2019-01-07 00:00:00 2019-01-07 23:59:00 TRUE           
#>  8     8 2019-01-08 00:00:00 2019-01-08 23:59:00 TRUE           
#>  9     9 2019-01-09 00:00:00 2019-01-09 23:59:00 FALSE          
#> 10    10 2019-01-10 00:00:00 2019-01-10 23:59:00 TRUE           
#> 11    11 2019-01-11 00:00:00 2019-01-11 23:59:00 TRUE           
#> 12    12 2019-01-12 00:00:00 2019-01-12 23:59:00 FALSE

# Desired result
desired <- tibble(
  group = c(1, 4, 7, 9),
  start = c("2019-01-01 00:00", "2019-01-04 00:00", "2019-01-07 00:00", "2019-01-10 00:00"),
  end = c("2019-01-03 23:59", "2019-01-06 23:59", "2019-01-09 23:59", "2019-01-12 23:59")
)

desired
#> # A tibble: 4 x 3
#>   group start            end             
#>   <dbl> <chr>            <chr>           
#> 1     1 2019-01-01 00:00 2019-01-03 23:59
#> 2     4 2019-01-04 00:00 2019-01-06 23:59
#> 3     7 2019-01-07 00:00 2019-01-09 23:59
#> 4     9 2019-01-10 00:00 2019-01-12 23:59

Created on 2019-03-22 by the reprex package (v0.2.1)

I'm looking for a short and clear solution that doesn't involve a myriad of helper tables and loops. The final value in the group column is not significant, I only care about the start and end columns from the result.

Ronak Shah Ronak Shah · Accepted Answer · 2019-03-22T10:20:06

We can use dplyr and create groups based on every time TRUE value occurs in merge_with_next column and select first value from start and last value from end column for each group.

library(dplyr)

df %>%
  group_by(temp = cumsum(!lag(merge_with_next, default = TRUE))) %>%
  summarise(group = first(group),
            start = first(start), 
            end = last(end)) %>%
  ungroup() %>%
  select(-temp)

#  group start               end     
#  <int> <dttm>              <dttm>             
#1     1 2019-01-01 00:00:00 2019-01-03 23:59:00
#2     4 2019-01-04 00:00:00 2019-01-06 23:59:00
#3     7 2019-01-07 00:00:00 2019-01-09 23:59:00
#4    10 2019-01-10 00:00:00 2019-01-12 23:59:00

Aggregate a tibble based on a consecutive values in a boolean column

1 Answers