percentage change between closest years

Question

I have a problem on creating a new variable Growth, which equals to percentage change in Population between closest years ending in “2” and “7”.

# dt
ID       Population      year
1                50      1995
1                60      1996
1                70      1997
1                80      1998
1                90      1999
1               100      2000
1               105      2001
1               110      2002
1               120      2003
1               130      2004
1               140      2005
1               150      2006
1               200      2007
1               300      2008

dt <- data.table::fread("ID       Population      year
1                50      1995
  1                60      1996
  1                70      1997
  1                80      1998
  1                90      1999
  1               100      2000
  1               105      2001
  1               110      2002
  1               120      2003
  1               130      2004
  1               140      2005
  1               150      2006
  1               200      2007
  1               300      2008", header = T)

Growth = Percentage change in Pop between closest years ending in “2” and “7”. For example, in the year:
1996: (1997 Pop – 1992 Pop) / 1992 Pop
1997: (2002 Pop – 1997 Pop) / 1997 Pop
1998: (2002 Pop – 1997 Pop) / 1997 Pop
1999: (2002 Pop – 1997 Pop) / 1997 Pop
2000: (2002 Pop – 1997 Pop) / 1997 Pop
2001: (2002 Pop – 1997 Pop) / 1997 Pop
2002: (2007 Pop – 2002 Pop) / 2002 Pop
2003: (2007 Pop – 2002 Pop) / 2002 Pop
2004: (2007 Pop – 2002 Pop) / 2002 Pop
2005: (2007 Pop – 2002 Pop) / 2002 Pop
2006: (2007 Pop – 2002 Pop) / 2002 Pop
2007: (2012 Pop – 2007 Pop) / 2007 Pop
2008: (2012 Pop – 2007 Pop) / 2007 Pop

However, when I manipulate the Growth, I need to do this by column ID. Moreover, the range of year is from 1970 to 2018, really wide range. How can I do this in data.table?

In your example, 1992 Pop doesn't exist, so how would you calculate Growth for 1996? 2012 Pop also doesn't exist — acylam

chinsoon12 chinsoon12 · Accepted Answer · 2018-09-20T00:28:24

Here is a possible data.table approach:

#calculate the 5-yearly percentage changes first by 
#i) first creating all combinations of ID and 5-yearly years
#2) then join with the original dataset 
#3) then leading the Population column and calculating Growth
pctChange <- dt[CJ(ID=ID, year=seq(1967, 2022, 5), unique=TRUE), 
    .(ID, year, Growth=(shift(Population, type="lead") - Population) / Population), 
    on=.(ID, year)]    

#then perform a rolling join (`roll=TRUE`; see ?data.table) and 
#then update the original dt with Growth by reference (i.e. `:=`)
dt[, Growth := pctChange[dt, Growth, on=.(ID, year), roll=TRUE]]
dt

output:

    ID Population year    Growth
 1:  1         50 1995        NA
 2:  1         60 1996        NA
 3:  1         70 1997 0.5714286
 4:  1         80 1998 0.5714286
 5:  1         90 1999 0.5714286
 6:  1        100 2000 0.5714286
 7:  1        105 2001 0.5714286
 8:  1        110 2002 0.8181818
 9:  1        120 2003 0.8181818
10:  1        130 2004 0.8181818
11:  1        140 2005 0.8181818
12:  1        150 2006 0.8181818
13:  1        200 2007        NA
14:  1        300 2008        NA

Point to note: rolling join appears not to work with update join

dt[pctChange, Growth := Growth, on=.(ID, year), roll=TRUE]

percentage change between closest years

3 Answers