2
votes

I have population data for the years: 1966, 1971, 1976. I would like to create another tidy data frame that includes the missing years (ie 1967, 1968, 1969, 1970, 1972, 1973, 1974, 1975). Linear interpolation is fine.

I'm thinking approx or approxfun is involved but I am not sure how.

library("tidyverse")

df <- tibble(
  year = c(1976, 1971, 1966, 1976, 1971, 1966, 1976, 1971, 1966),
  age_min = c(rep(0, 3), rep(5, 3), rep(10, 3)),
  population = c(180, 200, 300, 150, 250, 450, 25, 50, 150)
)

p1971 <- filter(df, year == 1971)
p1976 <- filter(df, year == 1976)

ggplot(data = p1971, aes(x = age_min, y = population)) +
  # 1971 population distribution
  geom_point(size = 4, color = "red") +
  geom_line(color = "red", size = 1.5) +
  geom_text(label = year, nudge_y = 9) +

  # 1976 population distribution
  geom_point(data = p1976, aes(x = age_min, y = population), size = 3, color = "blue") +
  geom_line(data = p1976, aes(x = age_min, y = population), color = "blue", size = 1.5) +
  geom_text(data = p1976, aes(x = age_min, y = population), label = p1976$year, nudge_y = -9)
1

1 Answers

2
votes

According to ?na.approx (from zoo)

Missing values (NAs) are replaced by linear interpolation via approx or cubic spline interpolation via spline, respectivel

So, we could use na.approx after creating the missing 'year' with complete

library(dplyr)
library(tidyr)
library(zoo)
df %>%
   complete(age_min, year = 1966:1976) %>%
   group_by(age_min) %>%
   mutate(population =na.approx(population, na.rm = FALSE))