2
votes

I'm running into trouble trying to replicate SQL window functions in R, in particular with relation to creating sum totals that specify the number of prior months I want to sum.

While the sqldf package in R allows for data manipulation, it doesn't seem to support window functions.

I have some mock data in R

set.seed(10)
data_1 <- data.table(Cust_ID = c(1,1,1,1,2,2,2,2,3,3,3,3),Month=c(4,3,2,1,4,3,2,1,4,3,2,1),
                          StatusCode=LETTERS[4:6],SalesValue=round(runif(12,50,1500)))

Cust_ID Month StatusCode SalesValue
   1     4          D        786
   1     3          E        495
   1     2          F        669
   1     1          D       1055
   2     4          E        173
   2     3          F        377
   2     2          D        448
   2     1          E        445
   3     4          F        943
   3     3          D        673
   3     2          E        995
   3     1          F        873

For each row, I would like to create a cumulative sum of values pertaining to the customer (Cust_ID), for the prior 2 months (not including the current month).

This would mean that for each customer, rows with Months 1 & 2 should be null (given there aren't 2 preceding months), Month 3 should contain summed SalesValue of Months 1 & 2 for that customer, and Month 4 should contain summed Sales Value for Month 2 & 3.

In SQL, I would use syntax similar to the following: SUM(SalesValue) OVER (PARTITION BY Cust_ID ORDER BY MONTH DESC ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING ) as PAST_3Y_SALES

Is there to achieve this in R - ideally using data.table (for efficiency)? Any guidance would be much appreciated.

PS Note: this is mock data, in my 'real' data customers have different data volumes - i.e. some customers have 5 months worth of data, others have >36 months worth of data, etc.

5

5 Answers

5
votes

Since, OP has used data.table, hence a solution using RcppRoll::roll_sumr with in scope of data.table can be as:

library(data.table)
library(RcppRoll)

# Order on 'Cust_ID' and 'Month'
setkeyv(data_1,c("Cust_ID","Month"))

data_1[, Sum_prev:=shift(roll_sumr(SalesValue, n=2)), by=Cust_ID]

data_1
#    Cust_ID Month StatusCode SalesValue Sum_prev
# 1:       1     1          D       1055       NA
# 2:       1     2          F        669       NA
# 3:       1     3          E        495     1724
# 4:       1     4          D        786     1164
# 5:       2     1          E        445       NA
# 6:       2     2          D        448       NA
# 7:       2     3          F        377      893
# 8:       2     4          E        173      825
# 9:       3     1          F        873       NA
# 10:       3     2          E        995       NA
# 11:       3     3          D        673     1868
# 12:       3     4          F        943     1668

The approach is to first calculate sum with width as 2 and then take previous value using data.table::shift with lag for current row having sum of previous 2 rows.

3
votes

Here is a solution using dplyr

library(dplyr)
library(zoo)
as.data.frame(data_1) %>%  group_by(Cust_ID) %>% arrange(Cust_ID, Month) %>%
              mutate(Sum_prev =rollapplyr(SalesValue, list(-(1:2)), sum, fill = NA)) 


# A tibble: 12 x 5
# Groups:   Cust_ID [3]
     Cust_ID Month StatusCode SalesValue Sum_prev
       <dbl> <dbl> <chr>           <dbl>    <dbl>
  1       1     1 D                1055       NA
  2       1     2 F                 669       NA
  3       1     3 E                 495     1724
  4       1     4 D                 786     1164
  5       2     1 E                 445       NA
  6       2     2 D                 448       NA
  7       2     3 F                 377      893
  8       2     4 E                 173      825
  9       3     1 F                 873       NA
 10       3     2 E                 995       NA
 11       3     3 D                 673     1868
 12       3     4 F                 943     1668

Using data.table:

library(data.table)
library(zoo)
#dt <- data_1[order(Cust_ID,Month)]
#dt[, Sum_prev:= rollapplyr(SalesValue, list(-(1:2)), sum, fill = NA), by=Cust_ID][]
#OR Without chaining 
data_1[, Sum_prev := rollapplyr(SalesValue, list((1:2)), sum, fill = NA), by = Cust_ID][order(Cust_ID,Month)]

      Cust_ID Month StatusCode SalesValue Sum_prev
  1:       1     1          D       1055    NA
  2:       1     2          F        669    NA
  3:       1     3          E        495  1724
  4:       1     4          D        786  1164
  5:       2     1          E        445    NA
  6:       2     2          D        448    NA
  7:       2     3          F        377   893
  8:       2     4          E        173   825
  9:       3     1          F        873    NA
 10:       3     2          E        995    NA
 11:       3     3          D        673  1868
 12:       3     4          F        943  1668
2
votes

A data.table solution:

# sort the data first if the Month column is not ordered for any Cust_ID
data_1 <- data_1[order(Cust_ID, Month)]

# sum up the value of two previous Month for each Cust_ID
data_1[, rsum :=  shift(SalesValue, 1) + shift(SalesValue, 2), by = Cust_ID]

#     Cust_ID Month StatusCode SalesValue rsum
#  1:       1     1          D       1055   NA
#  2:       1     2          F        669   NA
#  3:       1     3          E        495 1724
#  4:       1     4          D        786 1164
#  5:       2     1          E        445   NA
#  6:       2     2          D        448   NA
#  7:       2     3          F        377  893
#  8:       2     4          E        173  825
#  9:       3     1          F        873   NA
# 10:       3     2          E        995   NA
# 11:       3     3          D        673 1868
# 12:       3     4          F        943 1668
1
votes

1) sqldf/RpostgreSQL You can use windowing functions with a PostgreSQL backend and your code (slightly modified to work) within R like this (where data_1 is a data frame in your workspace).

library(RPostgreSQL)
library(sqldf)

sql <- 'select *, SUM("SalesValue") OVER (PARTITION BY "Cust_ID" 
                       ORDER BY "Month" DESC 
                       ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING ) as PAST_3Y_SALES 
        from "data_1"'

sqldf(sql)

giving:

   Cust_ID Month StatusCode SalesValue past_3y_sales
1        1     4          D        786            NA
2        1     3          E        495           786
3        1     2          F        669          1281
4        1     1          D       1055          1164
5        2     4          E        173            NA
6        2     3          F        377           173
7        2     2          D        448           550
8        2     1          E        445           825
9        3     4          F        943            NA
10       3     3          D        673           943
11       3     2          E        995          1616
12       3     1          F        873          1668

2) data.table/rollapply

Alternately use data.table with rollapply specifying the width as offsets using list(-2:-1).

The code below has been written to correspond to the SQL code in the question but if you wanted, instead, to have two NAs for each Cust_ID rather than one and sum previous months where months are in ascending order (not descending as specified in the question's SQL) then change -Month to Month in the setorder statement and remove the partial=TRUE argument in rollapply.

library(data.table)
library(zoo)

setorder(data_1, Cust_ID, -Month)
roll <- function(x) rollapply(x, list(-2:-1), sum, partial = TRUE, fill = NA)
data_1[, past_3y_sales := roll(SalesValue), by = Cust_ID]

giving:

> data_1
    Cust_ID Month StatusCode SalesValue past_3y_sales
 1:       1     4          D        786            NA
 2:       1     3          E        495           786
 3:       1     2          F        669          1281
 4:       1     1          D       1055          1164
 5:       2     4          E        173            NA
 6:       2     3          F        377           173
 7:       2     2          D        448           550
 8:       2     1          E        445           825
 9:       3     4          F        943            NA
10:       3     3          D        673           943
11:       3     2          E        995          1616
12:       3     1          F        873          1668
0
votes

I had a similar problem, but the solutions above didn't help me. My data was data_1:

CIF_ID  LEAD_RESULT 
10000009      1         
10000009      0          
10000025      0         
10000025      0         
10000055      0        

And I needed to sum LEAD_RESULT by CIF_ID.

I did the following within library(data.table):

dt <- data.table::as.data.table(data_1)
dt<-dt[, group_sum := sum(LEAD_RESULT), by = "CIF_ID"][]
dt

Result:

CIF_ID  LEAD_RESULT group_sum
10000009       1         1
10000009       0         1
10000025       0         0
10000025       0         0
10000055       0         0