0
votes

I have a problem on creating a new variable Growth, which equals to percentage change in Population between closest years ending in “2” and “7”.

# dt
ID       Population      year
1                50      1995
1                60      1996
1                70      1997
1                80      1998
1                90      1999
1               100      2000
1               105      2001
1               110      2002
1               120      2003
1               130      2004
1               140      2005
1               150      2006
1               200      2007
1               300      2008

dt <- data.table::fread("ID       Population      year
1                50      1995
  1                60      1996
  1                70      1997
  1                80      1998
  1                90      1999
  1               100      2000
  1               105      2001
  1               110      2002
  1               120      2003
  1               130      2004
  1               140      2005
  1               150      2006
  1               200      2007
  1               300      2008", header = T)

Growth = Percentage change in Pop between closest years ending in “2” and “7”. For example, in the year:
1996: (1997 Pop – 1992 Pop) / 1992 Pop
1997: (2002 Pop – 1997 Pop) / 1997 Pop
1998: (2002 Pop – 1997 Pop) / 1997 Pop
1999: (2002 Pop – 1997 Pop) / 1997 Pop
2000: (2002 Pop – 1997 Pop) / 1997 Pop
2001: (2002 Pop – 1997 Pop) / 1997 Pop
2002: (2007 Pop – 2002 Pop) / 2002 Pop
2003: (2007 Pop – 2002 Pop) / 2002 Pop
2004: (2007 Pop – 2002 Pop) / 2002 Pop
2005: (2007 Pop – 2002 Pop) / 2002 Pop
2006: (2007 Pop – 2002 Pop) / 2002 Pop
2007: (2012 Pop – 2007 Pop) / 2007 Pop
2008: (2012 Pop – 2007 Pop) / 2007 Pop

However, when I manipulate the Growth, I need to do this by column ID. Moreover, the range of year is from 1970 to 2018, really wide range. How can I do this in data.table?

3
In your example, 1992 Pop doesn't exist, so how would you calculate Growth for 1996? 2012 Pop also doesn't existacylam

3 Answers

2
votes

Here is a possible data.table approach:

#calculate the 5-yearly percentage changes first by 
#i) first creating all combinations of ID and 5-yearly years
#2) then join with the original dataset 
#3) then leading the Population column and calculating Growth
pctChange <- dt[CJ(ID=ID, year=seq(1967, 2022, 5), unique=TRUE), 
    .(ID, year, Growth=(shift(Population, type="lead") - Population) / Population), 
    on=.(ID, year)]    

#then perform a rolling join (`roll=TRUE`; see ?data.table) and 
#then update the original dt with Growth by reference (i.e. `:=`)
dt[, Growth := pctChange[dt, Growth, on=.(ID, year), roll=TRUE]]
dt

output:

    ID Population year    Growth
 1:  1         50 1995        NA
 2:  1         60 1996        NA
 3:  1         70 1997 0.5714286
 4:  1         80 1998 0.5714286
 5:  1         90 1999 0.5714286
 6:  1        100 2000 0.5714286
 7:  1        105 2001 0.5714286
 8:  1        110 2002 0.8181818
 9:  1        120 2003 0.8181818
10:  1        130 2004 0.8181818
11:  1        140 2005 0.8181818
12:  1        150 2006 0.8181818
13:  1        200 2007        NA
14:  1        300 2008        NA

Point to note: rolling join appears not to work with update join

dt[pctChange, Growth := Growth, on=.(ID, year), roll=TRUE]
1
votes

Not a data.table solution, but here is how you might do it in the tidyverse which could give you some ideas. Basically calculate the year to compare to for each row using integer division %/%, and then join the table onto itself so we have the comparison value in each row. Then it is simple to calculate the growth using your formula.

library(tidyverse)
dt <- read_table("ID       Population      year
1                50      1995
1                60      1996
1                70      1997
1                80      1998
1                90      1999
1               100      2000
1               105      2001
1               110      2002
1               120      2003
1               130      2004
1               140      2005
1               150      2006
1               200      2007
1               300      2008")
dt %>%
  mutate(compare_year = 5 * year %/% 5 - 3) %>%
  left_join(dt, by = c("ID", "compare_year" = "year")) %>%
  mutate(growth = (Population.x - Population.y) / Population.y)
#> # A tibble: 14 x 6
#>       ID Population.x  year compare_year Population.y growth
#>    <int>        <int> <int>        <dbl>        <int>  <dbl>
#>  1     1           50  1995         1992           NA NA    
#>  2     1           60  1996         1992           NA NA    
#>  3     1           70  1997         1992           NA NA    
#>  4     1           80  1998         1992           NA NA    
#>  5     1           90  1999         1992           NA NA    
#>  6     1          100  2000         1997           70  0.429
#>  7     1          105  2001         1997           70  0.5  
#>  8     1          110  2002         1997           70  0.571
#>  9     1          120  2003         1997           70  0.714
#> 10     1          130  2004         1997           70  0.857
#> 11     1          140  2005         2002          110  0.273
#> 12     1          150  2006         2002          110  0.364
#> 13     1          200  2007         2002          110  0.818
#> 14     1          300  2008         2002          110  1.73

Created on 2018-09-19 by the reprex package (v0.2.0).

1
votes

Similar to @calum_you, but using nearest 5yr growth, as requested in OP.

Sample data

dt <- data.table::fread("ID       Population      year
1                50      1995
  1                60      1996
  1                70      1997
  1                80      1998
  1                90      1999
  1               100      2000
  1               105      2001
  1               110      2002
  1               120      2003
  1               130      2004
  1               140      2005
  1               150      2006
  1               200      2007
  1               300      2008", header = T) %>%
  as_data_frame()  

1) 5yr growth rates table

dt_5yr <- dt %>% 
  filter(year %% 10 %in% c(2,7)) %>%
  group_by(ID) %>%   # I assume there will be more IDs in full data
  mutate(growth_5yr = Population / lag(Population) - 1) %>%
  ungroup()

2) Join original table to 5yr growth rates

dt %>%
  mutate(join_yr = year + 5 - (year + 3) %% 5) %>%
  left_join(dt_5yr %>% select(join_yr = year, growth_5yr))

Output

   ID Population year join_yr growth_5yr
1   1         50 1995    1997         NA
2   1         60 1996    1997         NA
3   1         70 1997    2002  0.5714286
4   1         80 1998    2002  0.5714286
5   1         90 1999    2002  0.5714286
6   1        100 2000    2002  0.5714286
7   1        105 2001    2002  0.5714286
8   1        110 2002    2007  0.8181818
9   1        120 2003    2007  0.8181818
10  1        130 2004    2007  0.8181818
11  1        140 2005    2007  0.8181818
12  1        150 2006    2007  0.8181818
13  1        200 2007    2012         NA
14  1        300 2008    2012         NA