I have a dataframe, lets call it df1, that looks something like this:
month product_key price
201408 00020e32-a64715 75
201408 00020e32-a64715 75
201408 000340b8-bacac8 20
201408 000458f1-fdb6ae 45
201408 00083ebb-e9c17f 250
201408 00207e67-15a59f 480
201408 002777d7-50bec1 12
201408 002777d7-50bec1 12
201409 00020e32-a64715 75
201409 000340b8-bacac8 20
201409 00083ebb-e9c17f 250
201409 00207e67-15a59f 480
201409 00207e67-15a59f 480
201409 00207e67-15a59f 480
201410 00083ebb-e9c17f 250
201410 00207e67-15a59f 480
201410 00207e67-15a59f 480
201410 0020baff-9730f0 39.99
201411 00083ebb-e9c17f 250
201411 00207e67-15a59f 480
201412 00083ebb-e9c17f 250
201501 00083ebb-e9c17f 200
201501 0020baff-9730f0 29.99
There are other variables in the dataset but we don't need them for this purpose. My dataset is monthly data and ranges from mid 2014 to late 2015. For each month there are hundreds of products and there can be the same product multiple times within the month.
What I want to do is identify products that appear in both say, August and September and remove the products that don't appear in both months. Then I want to calculate an average of prices, of the remaining products, for each month. Then I want to divide the average September price by the average August price. In my dataframe this calculated figure would be the September index (August is defaulted to 1 as this is where the dataset begins).
Then I would like to do the same for all the following months, so I would like to identify products that appear in both September and October, removing products that don't appear in both months, and calculate the average price (of the remaining products) for each month. Then I want to divide the average October price by the average September price (which will be different to the previously calculated September average price as there will most likely be different products that appear in both September and October, compared with products that appear in both August and September). This calculated figure would be the October index. So I want to do this for all of the following months (October & November, November & December, December & January, January & February... and so on)
My resulting dataframe would ideally look something like this (using arbitrary numbers as the index):
month index
201408 1
201409 1.0005
201410 1.0152
201411 0.9997
201412 0.9551
201501 0.8985
201502 0.9754
201503 1.0045
201504 1.1520
201505 1.0148
201506 1.0452
201507 0.9945
201508 0.9751
201509 1.0004
201510 1.0415
When I have attempted this I end up matching products over the entire dataset and not over 2 consecutive months. I can do this by breaking the dataset down into numerous datasets for each month but this seems long and tedious. I am sure there is a quicker way to do this?
You can use this code below to create a test dataset:
month <- c("201408", "201408", "201408", "201408", "201408", "201408", "201408", "201408", "201409", "201409", "201409", "201409", "201409", "201409", "201410", "201410", "201410", "201410", "201411", "201411", "201412", "201501", "201501")
product_key <- c("00020e32-a64715", "00020e32-a64715", "000340b8-bacac8", "000458f1-fdb6ae", "00083ebb-e9c17f", "00083ebb-e9c17f", "002777d7-50bec1", "002777d7-50bec1", "00020e32-a64715", "000340b8-bacac8", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "00207e67-15a59f", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "0020baff-9730f0", "00083ebb-e9c17f", "00207e67-15a59f", "00083ebb-e9c17f", "00083ebb-e9c17f", "0020baff-9730f0")
price <- c("75", "75", "20", "45", "250", "480", "12", "12", "75", "20", "250", "480", "480", "480", "250", "480", "480", "39.99", "250", "480", "250", "200", "29.99")
df1 <- data.frame(month, product_key, price)
To give an example of how I want this to work - here is what I did to create the index for August and September.
DF1Aug <- DF1 %>%
filter(month %in% "201408") %>%
group_by(product_key) %>%
summarize(aveprice=mean(price))
DF1Sept <- DF1 %>%
filter(month %in% "201409") %>%
group_by(product_key) %>%
summarize(aveprice=mean(price))
SeptPriceIndex <- transform(merge(DF1Aug, DF1Sept, by=c("product_key"), suffixes=c("_Aug", "_Sept"))) %>%
mutate(AugAvgPrice=mean(aveprice_Aug)) %>%
mutate(SeptAvgPrice=mean(aveprice_Sept)) %>%
mutate(priceIndex = SeptAvgPrice/AugAvgPrice)
However, this is obviously a tedious process to do this for about the 20 months I have in the dataframe (and I need to do this on more than one dataframe) so I would like to find a way to automate it.