R equivalent of SQL SUM OVER PARTITION BY ROWS PRECEDING

Question

I'm running into trouble trying to replicate SQL window functions in R, in particular with relation to creating sum totals that specify the number of prior months I want to sum.

While the sqldf package in R allows for data manipulation, it doesn't seem to support window functions.

I have some mock data in R

set.seed(10)
data_1 <- data.table(Cust_ID = c(1,1,1,1,2,2,2,2,3,3,3,3),Month=c(4,3,2,1,4,3,2,1,4,3,2,1),
                          StatusCode=LETTERS[4:6],SalesValue=round(runif(12,50,1500)))

Cust_ID Month StatusCode SalesValue
   1     4          D        786
   1     3          E        495
   1     2          F        669
   1     1          D       1055
   2     4          E        173
   2     3          F        377
   2     2          D        448
   2     1          E        445
   3     4          F        943
   3     3          D        673
   3     2          E        995
   3     1          F        873

For each row, I would like to create a cumulative sum of values pertaining to the customer (Cust_ID), for the prior 2 months (not including the current month).

This would mean that for each customer, rows with Months 1 & 2 should be null (given there aren't 2 preceding months), Month 3 should contain summed SalesValue of Months 1 & 2 for that customer, and Month 4 should contain summed Sales Value for Month 2 & 3.

In SQL, I would use syntax similar to the following: SUM(SalesValue) OVER (PARTITION BY Cust_ID ORDER BY MONTH DESC ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING ) as PAST_3Y_SALES

Is there to achieve this in R - ideally using data.table (for efficiency)? Any guidance would be much appreciated.

PS Note: this is mock data, in my 'real' data customers have different data volumes - i.e. some customers have 5 months worth of data, others have >36 months worth of data, etc.

MKR MKR · Accepted Answer · 2018-07-29T12:42:27

Since, OP has used data.table, hence a solution using RcppRoll::roll_sumr with in scope of data.table can be as:

library(data.table)
library(RcppRoll)

# Order on 'Cust_ID' and 'Month'
setkeyv(data_1,c("Cust_ID","Month"))

data_1[, Sum_prev:=shift(roll_sumr(SalesValue, n=2)), by=Cust_ID]

data_1
#    Cust_ID Month StatusCode SalesValue Sum_prev
# 1:       1     1          D       1055       NA
# 2:       1     2          F        669       NA
# 3:       1     3          E        495     1724
# 4:       1     4          D        786     1164
# 5:       2     1          E        445       NA
# 6:       2     2          D        448       NA
# 7:       2     3          F        377      893
# 8:       2     4          E        173      825
# 9:       3     1          F        873       NA
# 10:       3     2          E        995       NA
# 11:       3     3          D        673     1868
# 12:       3     4          F        943     1668

The approach is to first calculate sum with width as 2 and then take previous value using data.table::shift with lag for current row having sum of previous 2 rows.

R equivalent of SQL SUM OVER PARTITION BY ROWS PRECEDING

5 Answers