How do I do a rolling cumsum over consecutive rows of a tibble in R

Question

I have a toy example of a tibble. What is the most efficient way to sum two consecutive rows of y grouped by x

library(tibble)
l = list(x = c("a", "b", "a", "b", "a", "b"), y = c(1, 4, 3, 3, 7, 0))

df <- as_tibble(l)
df
#> # A tibble: 6 x 2
#>       x     y
#>   <chr> <dbl>
#> 1     a     1
#> 2     b     4
#> 3     a     3
#> 4     b     3
#> 5     a     7
#> 6     b     0

So the output would be something like this

   group   sum  seq
     a      4     1
     a     10     2
     b      7     1
     b      3     2

I'd like to use the tidyverse and possibly roll_sum() from the RcppRoll package and have the code so that a variable length of consecutive rows could be used for real world data in which there would be many groups

TIA

Psidom Psidom · Accepted Answer · 2017-09-27T01:53:16

One way to do this is use group_by %>% do where you can customize the returned data frame in do:

library(RcppRoll); library(tidyverse)

n = 2
df %>% 
    group_by(x) %>% 
    do(
        data.frame(
            sum = roll_sum(.$y, n), 
            seq = seq_len(length(.$y) - n + 1)
        )
    )

# A tibble: 4 x 3
# Groups:   x [2]
#      x   sum   seq
#  <chr> <dbl> <int>
#1     a     4     1
#2     a    10     2
#3     b     7     1
#4     b     3     2

Edit: Since this is not as efficient, probably due to the data frame construction header and binding data frames on the go, here is an improved version (still somewhat slower than data.table but not as much now):

df %>% 
    group_by(x) %>% 
    summarise(sum = list(roll_sum(y, n)), seq = list(seq_len(n() -n + 1))) %>%
    unnest()

Timing, use @Matt's data and setup:

library(tibble)
library(dplyr)
library(RcppRoll)
library(stringi) ## Only included for ability to generate random strings

## Generate data with arbitrary number of groups and rows --------------
rowCount   <- 100000
groupCount <- 10000
sumRows    <- 2L
set.seed(1)

l <- tibble(x = sample(stri_rand_strings(groupCount,3),rowCount,rep=TRUE),
            y = sample(0:10,rowCount,rep=TRUE))

## Using dplyr and tibble -----------------------------------------------

ptm <- proc.time() ## Start the clock

dplyr_result <- l %>% 
    group_by(x) %>% 
    summarise(sum = list(roll_sum(y, n)), seq = list(seq_len(n() -n + 1))) %>%
    unnest()


dplyr_time <- proc.time() - ptm ## Stop the clock

## Using data.table instead ----------------------------------------------

library(data.table)

ptm <- proc.time() ## Start the clock

setDT(l) ## Convert l to a data.table
dt_result <- l[,.(sum = RcppRoll::roll_sum(y, n = sumRows, fill = NA, align = "left"),
                  seq = seq_len(.N)),
               keyby = .(x)][!is.na(sum)]

data.table_time <- proc.time() - ptm

Result is:

dplyr_time
#   user  system elapsed 
#  0.688   0.003   0.689 
data.table_time
#   user  system elapsed 
#  0.422   0.009   0.430

How do I do a rolling cumsum over consecutive rows of a tibble in R

6 Answers