0
votes

I'm trying to create two new variables using dplyr/tidyr functions by using data from previous but related rows. In my example, I have a data.frame with the data for ID, YEAR and AMOUNT (see below).

    ID   YEAR    AMOUNT
    A    2000    4
    B    2000    4
    A    2001    2
    B    2001    3
    A    2002    3
    B    2002    1

I want to create the last two columns dynamically (see below) using dplyr/tidyr (I found a way to do it regularly but I'd be interested to know how to do it in the latest version in dplyr/tidyr).

    ID   YEAR    AMOUNT    YtY_VARIANCE    Y_VARIANCE
    A    2000    4         0               0
    B    2000    4         0               0
    A    2001    2         -2              -2
    B    2001    3         -1              -1
    A    2002    3         +1              -1
    B    2002    1         -2              -3

Where YtY_VARIANCE is obtained by subtracting the amount from the previous year and Y_VARIANCE is obtained by subtracting the amount from the first year (always).

I would expect code like this:

    raw_data %>%
        group_by(ID, YEAR) # Not even sure if that's required?!?
        mutate(YtY_VARIANCE = xxx(), Y_VARIANCE = yyy() )

What should be xxx and yyy?! Is that even possible?

Anyone can tell or point me to an answer?

Thanks in advance!

2

2 Answers

2
votes

You could try

 raw_data %>% 
      group_by(ID) %>% 
      mutate(YtY_VARIANCE=AMOUNT-lag(AMOUNT),
       YtY_VARIANCE=replace(YtY_VARIANCE, which(is.na(YtY_VARIANCE)), 0), 
      Y_VARIANCE=AMOUNT-AMOUNT[1L])
 #      ID YEAR AMOUNT YtY_VARIANCE Y_VARIANCE
 #1  A 2000      4            0          0
 #2  B 2000      4            0          0
 #3  A 2001      2           -2         -2
 #4  B 2001      3           -1         -1
 #5  A 2002      3            1         -1
 #6  B 2002      1           -2         -3
-2
votes

Thanks Matthew and Akrun! With both your solutions, I was able to find the final solution.

Akrun's suggested solution worked within my simple model but it was still having issues within my bigger/complex/final model. The final solution I ended up with was to combine both:

    raw_data %>% 
      group_by(ID) %>%
      arrange(YEAR) %>%
      mutate(YtY_VARIANCE=AMOUNT-lag(AMOUNT),
             YtY_VARIANCE=replace(YtY_VARIANCE, which(is.na(YtY_VARIANCE)), 0), 
             Y_VARIANCE=AMOUNT-first(AMOUNT))

Thanks for your help! And hopefully, it will help others!