0
votes

I have a dataframe with 6 variables:

Depr is a factor with 6 levels ("0", "1", "2", "3", "4", "5")

Sex is a factor with 3 levels ("Both sexes", "Female", "Male")

Age is a factor with 19 levels ("00-04", "05-09", "10-14", "15-19", "20-24", "25- 29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80-84", "85+","Total")

GL is a factor (geographical level) with 5 levels ("HPE","KFLA","LGL","ON","Regional")

YR is an integer (year), there are only two - 2011 and 2016 (census years)

And Pop is population count, an integer.

The dataframe is set up in long format where I have population counts for all factor combinations for each of the two years.

Depr     Sex           Age      GL       YR        Pop
0        Both sexes   00-04     ON       2011      395     
0        Both sexes   00-04     ON       2016      5550
...
1        Both sexes   00-04     ON       2011      495
1        Both sexes   00-04     ON       2016      3923

I want to interpolate for the years in between 2011 and 2016 (2012, 2013, 2014, 2015) for each row in my dataframe so that I get something like this:

Depr     Sex           Age      GL       YR        Pop
0        Both sexes   00-04     ON       2011      395     
0        Both sexes   00-04     ON       2012      456
0        Both sexes   00-04     ON       2013      689
0        Both sexes   00-04     ON       2014      2354
0        Both sexes   00-04     ON       2015      3446
0        Both sexes   00-04     ON       2016      5550

I have set up nested loops and am using approx to do the linear interpolation.

#create an empty dataframe to combine the results

fdepr <- data.frame(Depr = factor (levels = c("0", "1", "2", "3", "4", "5")), 
                    Sex = factor(levels = c("Both sexes", "Female", "Male")), 
                    Age = factor (levels = c("00-04", "05-09", "10-14", 
                    "15-19", "20-24", "25-29", "30-34", "35-39", "40-44", 
                    "45-49","50-54", "55-59", "60-64", "65-69", "70-74", "75- 
                     79", "80-84", "85+","Total")),
                    GL = factor(levels = c("HPE","KFLA","LGL","ON","Regional")), 
                    YR = integer(), 
                    Pop = integer())


#loops to subset Pop by grouping categories (depr is my original df)

for (i in unique(depr$Depr))
{ 
  for (j in unique(depr$Sex)) 
  {
    for (k in unique(depr$Age)) 
    {
      for (l in unique(depr$GL))  {
      temp <- subset(depr, subset=(Depr==i & Sex==j & Age==k & GL == l),select = c(YR, Pop))
      x <- temp$YR
      y <- temp$Pop
      t <- c(2011,2012,2013,2014,2015,2016)
      points <- approx(x,y, method = 'linear', xout=t)
      results <- data.frame(Depr=rep(i,6), Sex=rep(j,6), Age=rep(k,6), GL= rep(l,6), YR = points$x, Pop = points$y)
      fdepr <- rbind (fdepr,results)
    } 
  }}} 

It seems to go through and do the first round fine and populate results and fdepr as expected, but then I get Error in approx(x, y, method = "linear", xout = t) : need at least two non-NA values to interpolate

temp is empty and so are x and y. I'm not sure if it's something in the way fdepr is defined or if it's the nested loops that are the problem...

I'm not a data scientist so complex logic and programming is not intuitive - any insight is appreciated

1
Can you share depr or some other example input, as well as the desired output? - IceCreamToucan
depr has 2964 rows, I'm setting up an empty dataframe to put my interpolated results into - MegPophealth
I've just added something, but I can't put all 2964 combinations, unless I can share the entire dataset here? - MegPophealth

1 Answers

1
votes

In my opinion creating a new data frame and doing nested for-loops is making this more complicated than it needs to be.

Here I'm using group_by and expand to get the intermediate years for each of the data groups, then left_joining the original data frame to add the corresponding Pop values. After that, you just need to apply na.approx to each group of the data, and the data is already grouped from the expand part so you can just use mutate.

Of course, you could overwrite Pop in the mutate call instead of creating a new variable, I just did that for illustrative purposes.

library(zoo) # for na.approx
library(tidyverse) # for $>%, group_by, expand, left_join, and mutate

depr %>% 
  group_by(Depr, Sex, Age, GL) %>% 
  expand(YR = do.call(seq, as.list(YR))) %>% 
  left_join(depr, names(.)) %>% 
  mutate(Pop_interp = na.approx(Pop))

# # A tibble: 12 x 7
# # Groups:   Depr, Sex, Age, GL [2]
#     Depr Sex   Age   GL       YR   Pop Pop_interp
#    <int> <chr> <chr> <chr> <int> <int>      <dbl>
#  1     0 Both  00-04 ON     2011   395       395 
#  2     0 Both  00-04 ON     2012    NA      1426 
#  3     0 Both  00-04 ON     2013    NA      2457 
#  4     0 Both  00-04 ON     2014    NA      3488 
#  5     0 Both  00-04 ON     2015    NA      4519 
#  6     0 Both  00-04 ON     2016  5550      5550 
#  7     1 Both  00-04 ON     2011   495       495 
#  8     1 Both  00-04 ON     2012    NA      1181.
#  9     1 Both  00-04 ON     2013    NA      1866.
# 10     1 Both  00-04 ON     2014    NA      2552.
# 11     1 Both  00-04 ON     2015    NA      3237.
# 12     1 Both  00-04 ON     2016  3923      3923 

Here's the same thing with data.table and magrittr instead of tidyverse

library(zoo)
library(magrittr)
library(data.table)

depr[, .(YR = do.call(seq, as.list(YR))), .(Depr, Sex, Age, GL)] %>% 
  .[depr, on = names(.), Pop := i.Pop] %>% 
  .[, Pop_Interp := na.approx(Pop)] %>% 
  print

#     Depr  Sex   Age GL   YR  Pop Pop_Interp
#  1:    0 Both 00-04 ON 2011  395      395.0
#  2:    0 Both 00-04 ON 2012   NA     1426.0
#  3:    0 Both 00-04 ON 2013   NA     2457.0
#  4:    0 Both 00-04 ON 2014   NA     3488.0
#  5:    0 Both 00-04 ON 2015   NA     4519.0
#  6:    0 Both 00-04 ON 2016 5550     5550.0
#  7:    1 Both 00-04 ON 2011  495      495.0
#  8:    1 Both 00-04 ON 2012   NA     1180.6
#  9:    1 Both 00-04 ON 2013   NA     1866.2
# 10:    1 Both 00-04 ON 2014   NA     2551.8
# 11:    1 Both 00-04 ON 2015   NA     3237.4
# 12:    1 Both 00-04 ON 2016 3923     3923.0

Input data used

depr <- data.table::fread('
Depr     Sex           Age      GL       YR        Pop
0        Both    00-04     ON       2011      395     
0        Both    00-04     ON       2016      5550
1        Both    00-04     ON       2011      495     
1        Both    00-04     ON       2016      3923
', data.table = F)