2
votes

I have a dataframe, lets call it df1, that looks something like this:

month            product_key          price
201408           00020e32-a64715      75
201408           00020e32-a64715      75
201408           000340b8-bacac8      20
201408           000458f1-fdb6ae      45
201408           00083ebb-e9c17f      250
201408           00207e67-15a59f      480
201408           002777d7-50bec1      12
201408           002777d7-50bec1      12
201409           00020e32-a64715      75
201409           000340b8-bacac8      20
201409           00083ebb-e9c17f      250
201409           00207e67-15a59f      480
201409           00207e67-15a59f      480
201409           00207e67-15a59f      480
201410           00083ebb-e9c17f      250
201410           00207e67-15a59f      480
201410           00207e67-15a59f      480
201410           0020baff-9730f0      39.99
201411           00083ebb-e9c17f      250
201411           00207e67-15a59f      480
201412           00083ebb-e9c17f      250
201501           00083ebb-e9c17f      200
201501           0020baff-9730f0      29.99

There are other variables in the dataset but we don't need them for this purpose. My dataset is monthly data and ranges from mid 2014 to late 2015. For each month there are hundreds of products and there can be the same product multiple times within the month.

What I want to do is identify products that appear in both say, August and September and remove the products that don't appear in both months. Then I want to calculate an average of prices, of the remaining products, for each month. Then I want to divide the average September price by the average August price. In my dataframe this calculated figure would be the September index (August is defaulted to 1 as this is where the dataset begins).

Then I would like to do the same for all the following months, so I would like to identify products that appear in both September and October, removing products that don't appear in both months, and calculate the average price (of the remaining products) for each month. Then I want to divide the average October price by the average September price (which will be different to the previously calculated September average price as there will most likely be different products that appear in both September and October, compared with products that appear in both August and September). This calculated figure would be the October index. So I want to do this for all of the following months (October & November, November & December, December & January, January & February... and so on)

My resulting dataframe would ideally look something like this (using arbitrary numbers as the index):

month        index
201408       1
201409       1.0005      
201410       1.0152
201411       0.9997
201412       0.9551
201501       0.8985
201502       0.9754
201503       1.0045
201504       1.1520
201505       1.0148
201506       1.0452
201507       0.9945
201508       0.9751
201509       1.0004
201510       1.0415

When I have attempted this I end up matching products over the entire dataset and not over 2 consecutive months. I can do this by breaking the dataset down into numerous datasets for each month but this seems long and tedious. I am sure there is a quicker way to do this?

You can use this code below to create a test dataset:

month <- c("201408", "201408", "201408", "201408", "201408", "201408", "201408", "201408", "201409", "201409", "201409", "201409", "201409", "201409", "201410", "201410", "201410", "201410", "201411", "201411", "201412", "201501", "201501")
product_key <- c("00020e32-a64715", "00020e32-a64715", "000340b8-bacac8", "000458f1-fdb6ae", "00083ebb-e9c17f", "00083ebb-e9c17f", "002777d7-50bec1", "002777d7-50bec1", "00020e32-a64715", "000340b8-bacac8", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "00207e67-15a59f", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "0020baff-9730f0", "00083ebb-e9c17f", "00207e67-15a59f", "00083ebb-e9c17f", "00083ebb-e9c17f", "0020baff-9730f0")
price <- c("75", "75", "20", "45", "250", "480", "12", "12", "75", "20", "250", "480", "480", "480", "250", "480", "480", "39.99", "250", "480", "250", "200", "29.99")
df1 <- data.frame(month, product_key, price)

To give an example of how I want this to work - here is what I did to create the index for August and September.

DF1Aug <- DF1 %>%
  filter(month %in% "201408") %>%
  group_by(product_key) %>%
  summarize(aveprice=mean(price))


DF1Sept <- DF1 %>%
  filter(month %in% "201409") %>%
  group_by(product_key) %>%
  summarize(aveprice=mean(price))


SeptPriceIndex <- transform(merge(DF1Aug, DF1Sept, by=c("product_key"), suffixes=c("_Aug", "_Sept"))) %>%
            mutate(AugAvgPrice=mean(aveprice_Aug)) %>%
            mutate(SeptAvgPrice=mean(aveprice_Sept)) %>%
            mutate(priceIndex = SeptAvgPrice/AugAvgPrice)

However, this is obviously a tedious process to do this for about the 20 months I have in the dataframe (and I need to do this on more than one dataframe) so I would like to find a way to automate it.

2

2 Answers

0
votes

Something like the following could work using dplyr and tidy (updated):

df %>% 
  # ensure data is sorted so that months are sequential by product key:
  arrange(product_key, month) %>% 
  # ensure every product month combo exists:
  complete(product_key, month) %>%  
  # create a sequential id within each product:
  group_by(product_key) %>% 
  mutate(grp_seq = row_number()) %>% 
  # remove product / month pairs without a price:
  filter(!is.na(price)) %>%
  # remove product keys that appear in only one month:
  filter(n_distinct(month) > 1) %>% 
  # remove non-consecutive product / month pairs:
  filter(lead(grp_seq) - 1 == grp_seq | lag(grp_seq) + 1 == grp_seq) %>% 
  # summarize the average price by month:
  group_by(month) %>% 
  summarize(avg_price = mean(as.numeric(price))) %>%
  # calculate the price index:
  mutate(index_price = avg_price / lag(avg_price)) 

# A tibble: 6 x 3
  month  avg_price index_price
  <chr>      <dbl>       <dbl>
1 201408      180.      NA    
2 201409      298.       1.65 
3 201410      403.       1.36 
4 201411      365.       0.905
5 201412      250.       0.685
6 201501      200.       0.800  
0
votes

The OP wants to get the price index for two subsequent months by computing an average of all recorded prices across all recurrent products and by dividing the average monthly prices.

It might be that this what the OP intends but I am not convinced that this is the correct approach:

  1. According to the OP there can be the same product multiple times within the month. So, if one product has more recorded prices than other products it will have a greater impact on the average monthly price and hence the price index.
  2. Products with higher prices will dominate the average monthly price. So, price changes of cheaper products will be less visible in the price index.

Example

Here is made-up example to explain what I mean. Let's assume we have two products. Product A is expensive and has two recorded prices in April but there is no price change in May. Product B is cheap but its price has doubled in May. So, my expectation is that the price index will reflect this increase.

library(data.table)
example <- fread(
  "month   product_key price
  201704   A           90
  201704   A           110
  201704   B           1
  201705   A           100
  201705   B           2")

# OP's approach
example[, .(avg_price = mean(price)), by = month][
  , price_index := avg_price / shift(avg_price)][]
    month avg_price price_index
1: 201704        67          NA
2: 201705        51    0.761194

So, according to OP's approach the price index has dropped.

I believe the correct approach is

  1. to compute the average monthly price for each product
  2. to compute the price index for each product in subsequent months
  3. to compute the average price index across products for each month

(I apologize for using data.table syntax which I am more acquainted with. I have tried to use dplyr syntax but it took me too much time.)

# compute average monthly price for each product
tmp1 <- example[, .(avg_price = mean(price)), keyby = .(product_key, month)]
tmp1
   product_key  month avg_price
1:           A 201704       100
2:           A 201705       100
3:           B 201704         1
4:           B 201705         2
# compute price index for each product
tmp2 <- tmp1[, price_index := avg_price / shift(avg_price), by = product_key][]
tmp2
   product_key  month avg_price price_index
1:           A 201704       100          NA
2:           A 201705       100           1
3:           B 201704         1          NA
4:           B 201705         2           2
# compute average price index
tmp2[, .(avg_price_index = mean(price_index, na.rm = TRUE)), by = month]
    month avg_price_index
1: 201704             NaN
2: 201705             1.5

Now, the price index shows an increase according to my expectations (which might not be the OP's).

Compute price index for several month

The OP has requested to compute the price index for several months but only for products which appear in subsequent months. This can be solved by a self join with shifted months.

Note that a simple lag() or shift() is dangerous here because it relies on row order and will fail if months are missing. Therefore, date arithmetic is used to find the correct subsequent month.

The sef join approach has the additional benefit that only recurrent products are considered. If a product_key has no match in the subsequent month, price will be NA. Those entries will be dropped when calculating the average price index.

library(data.table)
library(magrittr)
DF2 <- setDT(DF1)[
  # convert price from factor to numeric
  , price := price %>% as.character() %>% as.numeric()][
    # convert character month to Date
    , month := month %>% lubridate::ymd(truncated = 1L)][
      # compute average monthly price for each product
      , .(avg_price = mean(price)), keyby = .(product_key, month)]

# self join with subsequent month 
DF2[DF2[, .(product_key, month = month + months(1), avg_price)],
    on = .(product_key, month)][
      # compute price index for each product
      , price_index := avg_price / i.avg_price][
        # compute average price index
        , .(avg_price_index = mean(price_index, na.rm = TRUE)), by = month]
        month avg_price_index
1: 2014-09-01       0.8949772
2: 2014-10-01       1.0000000
3: 2014-11-01       1.0000000
4: 2014-12-01       1.0000000
5: 2015-01-01       0.8000000
6: 2015-02-01             NaN

Data

As provided by the OP

month <- c("201408", "201408", "201408", "201408", "201408", "201408", "201408", "201408", "201409", "201409", "201409", "201409", "201409", "201409", "201410", "201410", "201410", "201410", "201411", "201411", "201412", "201501", "201501")
product_key <- c("00020e32-a64715", "00020e32-a64715", "000340b8-bacac8", "000458f1-fdb6ae", "00083ebb-e9c17f", "00083ebb-e9c17f", "002777d7-50bec1", "002777d7-50bec1", "00020e32-a64715", "000340b8-bacac8", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "00207e67-15a59f", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "0020baff-9730f0", "00083ebb-e9c17f", "00207e67-15a59f", "00083ebb-e9c17f", "00083ebb-e9c17f", "0020baff-9730f0")
price <- c("75", "75", "20", "45", "250", "480", "12", "12", "75", "20", "250", "480", "480", "480", "250", "480", "480", "39.99", "250", "480", "250", "200", "29.99")
DF1 <- data.frame(month, product_key, price)

Note that all columns are factors.