Matching on a variable, according to dates, and calculating ratios

Question

I have a dataframe, lets call it df1, that looks something like this:

month            product_key          price
201408           00020e32-a64715      75
201408           00020e32-a64715      75
201408           000340b8-bacac8      20
201408           000458f1-fdb6ae      45
201408           00083ebb-e9c17f      250
201408           00207e67-15a59f      480
201408           002777d7-50bec1      12
201408           002777d7-50bec1      12
201409           00020e32-a64715      75
201409           000340b8-bacac8      20
201409           00083ebb-e9c17f      250
201409           00207e67-15a59f      480
201409           00207e67-15a59f      480
201409           00207e67-15a59f      480
201410           00083ebb-e9c17f      250
201410           00207e67-15a59f      480
201410           00207e67-15a59f      480
201410           0020baff-9730f0      39.99
201411           00083ebb-e9c17f      250
201411           00207e67-15a59f      480
201412           00083ebb-e9c17f      250
201501           00083ebb-e9c17f      200
201501           0020baff-9730f0      29.99

There are other variables in the dataset but we don't need them for this purpose. My dataset is monthly data and ranges from mid 2014 to late 2015. For each month there are hundreds of products and there can be the same product multiple times within the month.

What I want to do is identify products that appear in both say, August and September and remove the products that don't appear in both months. Then I want to calculate an average of prices, of the remaining products, for each month. Then I want to divide the average September price by the average August price. In my dataframe this calculated figure would be the September index (August is defaulted to 1 as this is where the dataset begins).

Then I would like to do the same for all the following months, so I would like to identify products that appear in both September and October, removing products that don't appear in both months, and calculate the average price (of the remaining products) for each month. Then I want to divide the average October price by the average September price (which will be different to the previously calculated September average price as there will most likely be different products that appear in both September and October, compared with products that appear in both August and September). This calculated figure would be the October index. So I want to do this for all of the following months (October & November, November & December, December & January, January & February... and so on)

My resulting dataframe would ideally look something like this (using arbitrary numbers as the index):

month        index
201408       1
201409       1.0005      
201410       1.0152
201411       0.9997
201412       0.9551
201501       0.8985
201502       0.9754
201503       1.0045
201504       1.1520
201505       1.0148
201506       1.0452
201507       0.9945
201508       0.9751
201509       1.0004
201510       1.0415

When I have attempted this I end up matching products over the entire dataset and not over 2 consecutive months. I can do this by breaking the dataset down into numerous datasets for each month but this seems long and tedious. I am sure there is a quicker way to do this?

You can use this code below to create a test dataset:

month <- c("201408", "201408", "201408", "201408", "201408", "201408", "201408", "201408", "201409", "201409", "201409", "201409", "201409", "201409", "201410", "201410", "201410", "201410", "201411", "201411", "201412", "201501", "201501")
product_key <- c("00020e32-a64715", "00020e32-a64715", "000340b8-bacac8", "000458f1-fdb6ae", "00083ebb-e9c17f", "00083ebb-e9c17f", "002777d7-50bec1", "002777d7-50bec1", "00020e32-a64715", "000340b8-bacac8", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "00207e67-15a59f", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "0020baff-9730f0", "00083ebb-e9c17f", "00207e67-15a59f", "00083ebb-e9c17f", "00083ebb-e9c17f", "0020baff-9730f0")
price <- c("75", "75", "20", "45", "250", "480", "12", "12", "75", "20", "250", "480", "480", "480", "250", "480", "480", "39.99", "250", "480", "250", "200", "29.99")
df1 <- data.frame(month, product_key, price)

To give an example of how I want this to work - here is what I did to create the index for August and September.

DF1Aug <- DF1 %>%
  filter(month %in% "201408") %>%
  group_by(product_key) %>%
  summarize(aveprice=mean(price))


DF1Sept <- DF1 %>%
  filter(month %in% "201409") %>%
  group_by(product_key) %>%
  summarize(aveprice=mean(price))


SeptPriceIndex <- transform(merge(DF1Aug, DF1Sept, by=c("product_key"), suffixes=c("_Aug", "_Sept"))) %>%
            mutate(AugAvgPrice=mean(aveprice_Aug)) %>%
            mutate(SeptAvgPrice=mean(aveprice_Sept)) %>%
            mutate(priceIndex = SeptAvgPrice/AugAvgPrice)

However, this is obviously a tedious process to do this for about the 20 months I have in the dataframe (and I need to do this on more than one dataframe) so I would like to find a way to automate it.

sbha sbha · Accepted Answer · 2018-04-04T18:34:15

Something like the following could work using dplyr and tidy (updated):

df %>% 
  # ensure data is sorted so that months are sequential by product key:
  arrange(product_key, month) %>% 
  # ensure every product month combo exists:
  complete(product_key, month) %>%  
  # create a sequential id within each product:
  group_by(product_key) %>% 
  mutate(grp_seq = row_number()) %>% 
  # remove product / month pairs without a price:
  filter(!is.na(price)) %>%
  # remove product keys that appear in only one month:
  filter(n_distinct(month) > 1) %>% 
  # remove non-consecutive product / month pairs:
  filter(lead(grp_seq) - 1 == grp_seq | lag(grp_seq) + 1 == grp_seq) %>% 
  # summarize the average price by month:
  group_by(month) %>% 
  summarize(avg_price = mean(as.numeric(price))) %>%
  # calculate the price index:
  mutate(index_price = avg_price / lag(avg_price)) 

# A tibble: 6 x 3
  month  avg_price index_price
  <chr>      <dbl>       <dbl>
1 201408      180.      NA    
2 201409      298.       1.65 
3 201410      403.       1.36 
4 201411      365.       0.905
5 201412      250.       0.685
6 201501      200.       0.800

Matching on a variable, according to dates, and calculating ratios

2 Answers

Example

Compute price index for several month

Data