3
votes

I have a dataset that may contain MULTIPLE observations per date. So there could be 5 observations on date1, 2 observations on date2, and 1 observation on group3.

I want to calculate the moving average - by date - and importantly while not summarising/reducing' the number of rows. That is In this example above, I would still have 8 rows of data, and in a column next to it I would have that date's rolling average price I find this challenging because when I use a typical rolling function from ZOO package it goes line by-line and I dont know how to get it to skip by DATE

for example first step normally would be to:

df %>% 
 groupy_by(DATE) %>% 
 summarise(mean_daily_price = mean(price)) %>% 
 ungroup() %>% 
 arrange(Date) %>% 
 mutate( ra = rollapply(price, 2, mean, partial=T)   

--- but the summarise makes me lose rows.

      library(dplyr)
      library(zoo)


            DF = structure(list(Date = c("Jan-13", "Jan-13", "Jan-13", "Jan-13",  "Jan-13", "Jul-14", "Jul-14", "Oct-16"), Price = c(100L, 200L,  300L, 1000L, 400L, 150L, 50L, 600L), Average.by.Date = c(400L,  400L, 400L, 400L, 400L, 100L, 100L, 600L), Moving_Average_Size_2 = c(NA,  NA, NA, NA, NA, 250L, 250L, 350L)), .Names = c("Date", "Price", "Average.by.Date", "Moving_Average_Size_2"), class = "data.frame", row.names = c(NA, 
-8L))
2
Your question is a little unclear. What is your expected output? What do you mean by "skipping by date"?jdobres
Please make sure the variable names in your code match the variable names in your data frame and that your code doesn't have any other typos.eipi10

2 Answers

2
votes

In the code below, we use mutate instead of summarise to add mean_daily_price, so that we keep all rows of the data frame. Then, in the final mutate we run rollapply only on the unique values of mean_daily_price, but then use table and rep to repeat the output of rollapply by the number of rows for each Date.

DF %>% 
  arrange(Date) %>% 
  group_by(Date) %>% 
  mutate(mean_daily_price = mean(Price)) %>% 
  ungroup() %>% 
  mutate(ra = rep(rollapply(unique(mean_daily_price), 2, mean, fill=NA, align="right"), 
                  table(Date)[order(unique(Date))]))
    Date Price Average.by.Date Moving_Average_Size_2 mean_daily_price    ra
1 Jan-13   100             400                    NA              400    NA
2 Jan-13   200             400                    NA              400    NA
3 Jan-13   300             400                    NA              400    NA
4 Jan-13  1000             400                    NA              400    NA
5 Jan-13   400             400                    NA              400    NA
6 Jul-14   150             100                   250              100   250
7 Jul-14    50             100                   250              100   250
8 Oct-16   600             600                   350              600   350
1
votes

I think that your safest approach will need to be a two step process -- calculate the rolling averages by Date then merge them back in (still using dplyr here)

rolledAvg <-
  DF %>%
  group_by(Date) %>%
  summarise(mean_daily_price = mean(Price)) %>%
  ungroup() %>%
  arrange(Date) %>%
  mutate( ra = rollapply(mean_daily_price
                         , 2
                         , mean
                         , partial=T
                         , fill = NA))

left_join(DF, rolledAvg)

gives:

    Date Price Average.by.Date Moving_Average_Size_2 mean_daily_price  ra
1 Jan-13   100             400                    NA              400 250
2 Jan-13   200             400                    NA              400 250
3 Jan-13   300             400                    NA              400 250
4 Jan-13  1000             400                    NA              400 250
5 Jan-13   400             400                    NA              400 250
6 Jul-14   150             100                   250              100 350
7 Jul-14    50             100                   250              100 350
8 Oct-16   600             600                   350              600 600

I see in a comment to another answer that you don't think the first ra should be 250 -- if that is the case, change the calculation in your call to rollapply. Right now, it appears to be behaving as expected/documented. So, if you want something different, you will need to explain the change you want (probably a separate question).

Another caveat, particularly important for any other readers that stumble on this: this current approach treats sequential date entries as equidistant, no matter how far apart they actually are. If this works for your need, great. But, in many use cases, you may need to take care of the actual amount of time between measurements.

Similarly, the current approach loses all information about the number of measurements taken, it may be worthwhile to consider an approach that weights by number of observations (unless you are happy trusting each day's calculated average).