0
votes

I have a database with sales value for individual firms that belong to different industries. In the example dataset below:

set.seed(123)
df <- data.table(year=rep(1980:1984,each=4),sale=sample(100:150,20),ind=sample(LETTERS[1:2],20,replace = TRUE))
df[order(year,ind)]
    year sale ind
 1: 1980  114   A
 2: 1980  102   A
 3: 1980  130   B
 4: 1980  113   B
 5: 1981  136   A
 6: 1981  148   A
 7: 1981  141   B
 8: 1981  142   B
 9: 1982  124   A
10: 1982  125   A
11: 1982  104   A
12: 1982  126   B
13: 1983  108   A
14: 1983  128   A
15: 1983  140   B
16: 1983  127   B
17: 1984  134   A
18: 1984  107   A
19: 1984  106   A
20: 1984  146   B

The column "ind" represents industry and I have omitted the firm identifiers (no use in this example). I want an average defined as follows:

For each year, the desired average is the average of all firms within the industry over the past three years. If the data for past three years is not available, a minimum of two observations is also acceptable.

For example, in the above dataset, if year=1982, and ind=A, there are only two observations for past years (which is still acceptable), so the desired average is the average of all sale values in years 1980 and 1981 for industry A.

If year=1983, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1980, 1981, and 1982 for industry A.

If year=1984, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1981, 1982, and 1983 for industry A.

The desired output, thus, will be as follows:

    year sale ind   mymean
 1: 1980  130   B       NA
 2: 1980  114   A       NA
 3: 1980  113   B       NA
 4: 1980  102   A       NA
 5: 1981  141   B       NA
 6: 1981  142   B       NA
 7: 1981  136   A       NA
 8: 1981  148   A       NA
 9: 1982  124   A 125.0000
10: 1982  125   A 125.0000
11: 1982  126   B 131.5000
12: 1982  104   A 125.0000
13: 1983  140   B 130.4000
14: 1983  127   B 130.4000
15: 1983  108   A 121.8571
16: 1983  128   A 121.8571
17: 1984  134   A 124.7143
18: 1984  107   A 124.7143
19: 1984  146   B 135.2000
20: 1984  106   A 124.7143

A data.table solution is much preferred for fast implementation. Many thanks in advance.

3
Please check, both methods will give different resultsAnilGoyal

3 Answers

2
votes

I am not very good in data.table. Here is one tidyverse solution if you like or if you can translate it to data.table

library(tidyverse)

df %>% group_by(ind, year) %>% 
  summarise(ds = sum(sale),
            dn = n()) %>%
  mutate(ds = (lag(ds,1)+lag(ds,2)+ifelse(is.na(lag(ds,3)), 0, lag(ds,3)))/(lag(dn,1)+lag(dn,2)+ifelse(is.na(lag(dn,3)), 0, lag(dn,3)))
  ) %>% select(ind, year, mymean = ds) %>%
  right_join(df, by = c("ind", "year"))

`summarise()` regrouping output by 'ind' (override with `.groups` argument)
# A tibble: 20 x 4
   ind    year mymean  sale
   <chr> <int>  <dbl> <int>
 1 A      1980    NA    114
 2 A      1980    NA    102
 3 A      1981    NA    136
 4 A      1981    NA    148
 5 A      1982   125    124
 6 A      1982   125    125
 7 A      1982   125    104
 8 A      1983   122.   108
 9 A      1983   122.   128
10 A      1984   125.   134
11 A      1984   125.   107
12 A      1984   125.   106
13 B      1980    NA    130
14 B      1980    NA    113
15 B      1981    NA    141
16 B      1981    NA    142
17 B      1982   132.   126
18 B      1983   130.   140
19 B      1983   130.   127
20 B      1984   135.   146
2
votes

You can use zoo's rollapply function to perform this rolling calculation. Note that there are dedicated functions to calculate rolling mean like frollmean in data.table and rollmean in zoo but they lack the argument partial = TRUE present in rollapply. partial = TRUE is useful here since you want to calculate the mean even if the window size is less than 3.

We can first calculate mean of sale value for each ind and year, then perform rolling mean calculation with window size of 3 and join this data with the original dataframe to get all the rows of original dataframe back.

library(data.table)
library(zoo)

df1 <- df[, .(sale = mean(sale)), .(ind, year)]
df2 <- df1[, my_mean := shift(rollapplyr(sale, 3, function(x) 
  if(length(x) > 1) mean(x, na.rm = TRUE) else NA, partial = TRUE)), ind]
df[df2, on = .(ind, year)]

This can be written using dplyr as :

library(dplyr)

df %>%
  group_by(ind, year) %>%
  summarise(sale = mean(sale)) %>%
  mutate(avg_mean = lag(rollapplyr(sale, 3, partial = TRUE, function(x) 
                    if(length(x) > 1) mean(x, na.rm = TRUE) else NA))) %>% 
  left_join(df, by = c('ind', 'year'))
1
votes

Based on Ronak's answer (the mean of previous means), a more general way (the mean of all previous values), and a data.table solution then can be:

library(data.table)
library(roll)

df1 <- df[, .(sum_1 = sum(sale), n=length(sale)), .(ind, year)]
df1[,`:=`(
  my_sum = roll_sum(shift(sum_1),3,min_obs = 2),
  my_n = roll_sum(shift(n),3,min_obs = 2)
  ),by=.(ind)]
df1[,`:=`(my_mean=(my_sum/my_n))]
> df[df1[,!c("sum_1","n","my_sum","my_n")] ,on = .(ind, year)]
    year sale ind  my_mean
 1: 1980  130   B       NA
 2: 1980  113   B       NA
 3: 1980  114   A       NA
 4: 1980  102   A       NA
 5: 1981  141   B       NA
 6: 1981  142   B       NA
 7: 1981  136   A       NA
 8: 1981  148   A       NA
 9: 1982  124   A 125.0000
10: 1982  125   A 125.0000
11: 1982  104   A 125.0000
12: 1982  126   B 131.5000
13: 1983  140   B 130.4000
14: 1983  127   B 130.4000
15: 1983  108   A 121.8571
16: 1983  128   A 121.8571
17: 1984  134   A 124.7143
18: 1984  107   A 124.7143
19: 1984  106   A 124.7143
20: 1984  146   B 135.2000