1
votes

I have a data frame that look like this:

z <- data.frame(ent = c(1, 1, 1, 2, 2, 2, 3, 3, 3), year = c(1995, 2000, 2005, 1995, 2000, 2005, 1995, 2000, 2005), pobtot = c(50, 60, 70, 10, 4, 1, 100, 105, 110))

As you can see, there is a gap between 5 years for every "ent". I want to interpolate data to every missing year: 1996, 1997, 1998, 1999, 2001, 2002, 2003, 2004 and also prognosticate to 2006, 2007 and 2008. Is there a way to do this?

Any help would be appreciated.

2
What type of modeling assumption to you want to make to fill in missing values? Do you just want a linear interpolation? And then what assumptions to you want to make for future values? There are really statistical modeling questions, not R programming questions.MrFlick
What is your expected output for the data shared?Ronak Shah
Actually I got 2 answers and both are usefulNeuxis

2 Answers

2
votes

We can use complete to expand the data for each 'ent' and the 'year' range, then with na.approx interpolate the missing values in 'pobtot'

library(dplyr)
library(tidyr)
z %>% 
   complete(ent, year = 1995:2008) %>% 
   mutate(pobtot = zoo::na.approx(pobtot, na.rm = FALSE))
2
votes

Assuming you want linear interpolation, R uses approx() for such things by default, e.g. for drawing lines in a plot. We may also use that function to interpolate the years. It doesn't extrapolate, though, but we could use forecast::ets() with default settings for this which calculates an exponential smoothing state space model. Note, however, that this may also produce negative values, but OP hasn't stated what is needed in such a case. So anyway in a by() approach we could do:

library(forecast)
p <- 3  ## define number of years for prediction

res <- do.call(rbind, by(z, z$ent, function(x) {
  yseq <- min(x$year):(max(x$year) + p)  ## sequence of years + piction
  a <- approx(x$year, x$pobtot, head(yseq, -p))$y  ## linear interpolation
  f <- predict(ets(a), 3) ## predict `p` years
  r <- c(a, f$mean)  ## combine interpolation and prediction
  data.frame(ent=x$ent[1], year=yseq, pobtot=r)  ## output as data frame
}))

Result

res
#      ent year pobtot
# 1.1    1 1995   50.0
# 1.2    1 1996   52.0
# 1.3    1 1997   54.0
# 1.4    1 1998   56.0
# 1.5    1 1999   58.0
# 1.6    1 2000   60.0
# 1.7    1 2001   62.0
# 1.8    1 2002   64.0
# 1.9    1 2003   66.0
# 1.10   1 2004   68.0
# 1.11   1 2005   70.0
# 1.12   1 2006   72.0
# 1.13   1 2007   74.0
# 1.14   1 2008   76.0
# 2.1    2 1995   10.0
# 2.2    2 1996    8.8
# 2.3    2 1997    7.6
# 2.4    2 1998    6.4
# 2.5    2 1999    5.2
# 2.6    2 2000    4.0
# 2.7    2 2001    3.4
# 2.8    2 2002    2.8
# 2.9    2 2003    2.2
# 2.10   2 2004    1.6
# 2.11   2 2005    1.0
# 2.12   2 2006    0.4
# 2.13   2 2007   -0.2
# 2.14   2 2008   -0.8
# 3.1    3 1995  100.0
# 3.2    3 1996  101.0
# 3.3    3 1997  102.0
# 3.4    3 1998  103.0
# 3.5    3 1999  104.0
# 3.6    3 2000  105.0
# 3.7    3 2001  106.0
# 3.8    3 2002  107.0
# 3.9    3 2003  108.0
# 3.10   3 2004  109.0
# 3.11   3 2005  110.0
# 3.12   3 2006  111.0
# 3.13   3 2007  112.0
# 3.14   3 2008  113.0

We could quickly check this in a plot, which, apart from the negative values of entity 2 looks quite reasonable.

with(res, plot(year, pobtot, type='n', main='z'))
with(res[res$year < 2006, ], points(year, pobtot, pch=20, col=3))
with(res[res$year > 2005, ], points(year, pobtot, pch=20, col=4))
with(res[res$year %in% z$year, ], points(year, pobtot, pch=20, col=1))
abline(h=0, lty=3)
legend(2005.25, 50, c('measurem.', 'interpol.', 'extrapol.'), pch=20,
       col=c(1, 3, 4), cex=.8, bty='n')

![enter image description here