1
votes

Starting with the following:

library(tidyverse)
library(lubridate)

df <- tibble(
  date = seq.Date(ymd("2018-01-01"), by = "month", length.out = 6),
  y = c(20, 10, 15, 35, 40, 50)
)

df 
#> # A tibble: 6 x 2
#>   date           y
#>   <date>     <dbl>
#> 1 2018-01-01    20
#> 2 2018-02-01    10
#> 3 2018-03-01    15
#> 4 2018-04-01    35
#> 5 2018-05-01    40
#> 6 2018-06-01    50

I would like to create a new column, z that is a recursive rolling-6-period average. That is, for 2018-07-01 this is simply the average of the last six records, but for 2018-08-01 forward, we use the (relevant) previously computed rolling average(s) in the new rolling calculation.

2018-07-01 = mean(c(20, 10, 15, 35, 40, 50)) = 28.3333
2018-08-01 = mean(c(10, 15, 35, 40, 50, 28.3333)) = 29.7222
2018-09-01 = mean(c(15, 35, 40, 50, 28.3333, 29.7222) = 33.0093
...etc...

I have tried a few things with tibbletime::rollify and zoo::rollmeanr, but neither allow me to recursively reference the last computed rolling average.

Desired Output:

desired_df <- tibble(
  date = seq.Date(ymd("2018-01-01"), by = "month", length.out = 22),
  y = c(20, 10, 15, 35, 40, 50, rep(NA, 16)),
  z = c(
    rep(NA, 6), 
    28.3333, 29.7222, 33.0093, 36.0108, 36.1793, 35.5425, 33.1329, 
    33.9328, 34.6346, 34.9055, 34.7213, 34.4783, 34.3009, 34.4955, 
    34.5893, 34.5818
  )
)
desired_df
#> # A tibble: 22 x 3
#>    date           y     z
#>    <date>     <dbl> <dbl>
#>  1 2018-01-01    20  NA  
#>  2 2018-02-01    10  NA  
#>  3 2018-03-01    15  NA  
#>  4 2018-04-01    35  NA  
#>  5 2018-05-01    40  NA  
#>  6 2018-06-01    50  NA  
#>  7 2018-07-01    NA  28.3
#>  8 2018-08-01    NA  29.7
#>  9 2018-09-01    NA  33.0
#> 10 2018-10-01    NA  36.0
#> # ... with 12 more rows
1
What is wrong with a for loop here? - duckmayr
@duckmayr it could work, but this is part of a much larger analysis in which the length of the loop may not be able to be determined up front without a lot of effort, but feel free to post your solution. - JasonAizkalns
That actually begs the question also: What's the stopping rule here? You start with a 6 row df and end with a 22 row desired_df. Why 22, and not, say, 50? - duckmayr
@duckmayr there are multiple stopping criteria, but for keeping this question manageable, we can assume 22 would cover the majority of the use cases. - JasonAizkalns

1 Answers

3
votes

We can create a function that uses a simple for-loop as a easy solution.

recursive_roll <- function(x, fn = mean, window_size = 6, ...) {
    # Use fn (mean by default) on a rolling recursive window
    # ... are arguments passed to fn
    n <- length(x)
    result <- x
    for ( i in (window_size + 1):n ) {
        result[i] <- fn(result[(i - window_size):(i - 1)], ...)
    }
    # I add in this line below to make it in line with your desired output.
    # You may choose to omit this (keep the initial values of your vector),
    # or even make this part optional.
    result[1:window_size] <- NA
    return(result)
}

Something to note about your algorithm is eventually it converges to a single number that will be repeated. I use 50 observations rather than 22 to demonstrate this:

library(dplyr)
library(lubridate)

N <- 50 # Total number of observations; I use 50 to illustrate convergence
window_size <- 6

df <- tibble(
    date = seq.Date(ymd("2018-01-01"), by = "month", length.out = N),
    y = c(20, 10, 15, 35, 40, 50, rep(NA, N - window_size))
)

desired_df <- df %>% mutate(z = recursive_roll(y))

Let's check the results:

desired_df
# A tibble: 50 x 3
   date           y     z
   <date>     <dbl> <dbl>
 1 2018-01-01    20  NA  
 2 2018-02-01    10  NA  
 3 2018-03-01    15  NA  
 4 2018-04-01    35  NA  
 5 2018-05-01    40  NA  
 6 2018-06-01    50  NA  
 7 2018-07-01    NA  28.3
 8 2018-08-01    NA  29.7
 9 2018-09-01    NA  33.0
10 2018-10-01    NA  36.0
# … with 40 more rows
tail(desired_df)
# A tibble: 6 x 3
  date           y     z
  <date>     <dbl> <dbl>
1 2021-09-01    NA  34.5
2 2021-10-01    NA  34.5
3 2021-11-01    NA  34.5
4 2021-12-01    NA  34.5
5 2022-01-01    NA  34.5
6 2022-02-01    NA  34.5

plot(desired_df$date, desired_df$z, type = "l")

enter image description here

To be more specific, the number your algorithm converges to can be derived analytically as

r <- sum(1:window_size * head(desired_df$y, window_size)) / sum(1:window_size)

After using an N = 500, we see

desired_df$z[N] == r
# [1] TRUE
sprintf("%.17f", c(desired_df$z[N], r))
# [1] "34.52380952380952550" "34.52380952380952550"

This is because you're only using window_size observations; what you may like even more is something more like an exponentially weighted moving average:

ewma <- function(x, weight = 1 / (length(x) + 1)) {
    # Gives the exponentially weighted moving average, defined as:
    # EWMA_t = weight * x_t + (1 - weight) * EWMA_{t-1}
    result <- x
    for ( i in 2:length(x) ) {
        result[i] <- weight * result[i] + (1 - weight) * result[i - 1]
    }
    return(result)
}

set.seed(123)
N <- 50
x <- c(20, 10, 15, 35, 40, 50)
df <- tibble(
    date = seq.Date(ymd("2018-01-01"), by = "month", length.out = N),
    y = c(x, sample(30:50, size = N - window_size, replace = TRUE))
)

df2 <- df %>% mutate(z = recursive_roll(y), z2 = ewma(y))
plot(df2$date, df2$y, pch = 20, col = "#80808080")
lines(df2$date, df2$z, col = "blue")
lines(df2$date, df2$z2, col = "red")

enter image description here