Time series data consists of:
Product (categorical); ProductGroup (categorical); Country (categorical); YearSinceProductLaunch (numeric); SalesAtLaunchYear (numeric)
Only "SalesAtLaunchYear" data has some missing values which needs to be imputed.
For some products, there is complete data, i.e. sales data exists for the launch year 1,2 and up to now.
Some other products, however, contain missing sales data only for the early years since launch. Products have different age, therefore sometimes there are 2 years since launch that are missing, sometimes there are 10 years, which depends on the product.
I am interested to find a model in R that can impute the missing time series data gaps. I have tried MICE by setting the model for "SalesAtLaunchYear" as random forest, but I am still getting some very high values of sales especially at the beginning of the product's launch. I am ensuring that at Year 0, all sales are 0 to avoid negative values. The data frame has 20000 rows with 300 unique products.
testdf = tibble::tribble(
~Country, ~ProductGroup, ~Product, ~YearSinceProductLaunch, ~SalesAtLaunchYear,
"CA", "ProductGroup1", "Product1", 0L, 0,
"CA", "ProductGroup1", "Product1", 1L, NA,
"CA", "ProductGroup1", "Product1", 2L, NA,
"CA", "ProductGroup1", "Product1", 3L, NA,
"CA", "ProductGroup1", "Product1", 4L, NA,
"CA", "ProductGroup1", "Product1", 5L, 206034.9814,
"CA", "ProductGroup1", "Product1", 6L, 170143.2623,
"CA", "ProductGroup1", "Product1", 7L, 212541.9306,
"CA", "ProductGroup1", "Product1", 8L, 270663.199,
"CA", "ProductGroup1", "Product1", 9L, 736738.3755,
"CA", "ProductGroup1", "Product1", 10L, 2579723.981,
"CA", "ProductGroup1", "Product1", 11L, 4964319.496,
"CA", "ProductGroup1", "Product1", 12L, 6864985.16,
"CA", "ProductGroup1", "Product1", 13L, 8793292.386,
"CA", "ProductGroup1", "Product1", 14L, 11416033.38,
"IT", "ProductGroup2", "Product2", 0L, 0,
"IT", "ProductGroup2", "Product2", 1L, NA,
"IT", "ProductGroup2", "Product2", 2L, NA,
"IT", "ProductGroup2", "Product2", 3L, NA,
"IT", "ProductGroup2", "Product2", 4L, NA,
"IT", "ProductGroup2", "Product2", 5L, NA,
"IT", "ProductGroup2", "Product2", 6L, NA,
"IT", "ProductGroup2", "Product2", 7L, NA,
"IT", "ProductGroup2", "Product2", 8L, NA,
"IT", "ProductGroup2", "Product2", 9L, NA,
"IT", "ProductGroup2", "Product2", 10L, NA,
"IT", "ProductGroup2", "Product2", 11L, NA,
"IT", "ProductGroup2", "Product2", 12L, NA,
"IT", "ProductGroup2", "Product2", 13L, 30806222.96,
"IT", "ProductGroup2", "Product2", 14L, 31456272,
"IT", "ProductGroup2", "Product2", 15L, 31853476.78,
"IT", "ProductGroup2", "Product2", 16L, 30379818,
"IT", "ProductGroup2", "Product2", 17L, 29765448.87,
"IT", "ProductGroup2", "Product2", 18L, 31376234,
"IT", "ProductGroup2", "Product2", 19L, 32628514.81,
"IT", "ProductGroup2", "Product2", 20L, 32732196,
"IT", "ProductGroup2", "Product2", 21L, 33503784.25,
"IT", "ProductGroup2", "Product2", 22L, 35163372,
"DE", "ProductGroup3", "Product3", 0L, 0,
"DE", "ProductGroup3", "Product3", 1L, 161884.081,
"DE", "ProductGroup3", "Product3", 2L, 7876925.474,
"DE", "ProductGroup3", "Product3", 3L, 12948209.55,
"DE", "ProductGroup3", "Product3", 4L, 13304401.76
)
testdf$Country = as.factor(testdf$Country)
testdf$ProductGroup = as.factor(testdf$ProductGroup)
testdf$Product = as.factor(testdf$Product)
str()
,head()
or screenshot)? You can use thereprex
anddatapasta
packages to assist you with that. See also Help me Help you & How to make a great R reproducible example? – Tung