1
votes

Using this data.frame

DATA

df <- read.table(text = c("
SiteID  measured    modelled
site1   50  47
site2   28  30
site3   158 162
site4   247 243
site5   456 463
site6   573 564
site7   634 640"), ,header =T)

I want to create two new columns (measured_diff and modelled_diff). In these two new columns the value for site1 and site2 will be the same as the value in measured and modelled. However, for the rest of the sites the value will be similar to below

measured_diff for site3 = measured for site3 - sum(measured for site1 and site2)

measured_diff for site4 = measured for site4 - measured for site3

measured_diff for site5 = measured for site5 - measured for site4

measured_diff for site6 = measured for site6 - measured for site5

measured_diff for site7 = measured for site7 - measured for site6

and the same for modelled_diff

FINAL RESULT

It should be as below

#  SiteID measured modelled diff_measured diff_modelled
#1  site1       50       47            50            47
#2  site2       28       30            28            30
#3  site3      158      162            80            85
#4  site4      247      243            89            81
#5  site5      456      463           209           220
#6  site6      573      564           117           101
#7  site7      634      640            61            76

Any suggestions how to do this in R using dplyr?

2
I am confused. If I follow your logic, diff_measured for site 4 seems to be 11: 247 - (50 + 28 + 158) = 11. Am I missing something? - jazzurro
Yeah, it seems like it should be df %>% mutate_at(vars(-SiteID), funs(diff = . - lag(cumsum(.), default = 0))), but the numbers are off. - alistaire

2 Answers

7
votes

You can get a little fancy with funs, wherein . represents the vector you're operating on.

Using mutate_at with vars(-SiteID) will call whatever function you pass it on every variable except SiteID (thus the -, just like subsetting). To create the function to pass it, you could write your own before the pipe, but the *_at (and *_if and *_all) functions can take a funs helper function which makes it easy to create functions on the fly. If you name your function within funs like you would a list item (e.g. list(a = 1)), it will make a new version of each variable, using the name as a suffix. The function can be anything, as long as it returns a vector of the appropriate length, and can be constructed with ., which within funs represents the vector it's operating on—here the whole variable, but if grouped, the vector of the group. Thus, we can make a vector piecewise with c, where

  • the first two terms stay the same .[1:2],
  • the third subtracts the sum of the first two .[3] - sum(.[1:2]), and
  • the rest are passed to base::diff, which returns a vector one shorter than you pass it.

All together:

df %>% mutate_at(vars(-SiteID), 
                 funs(diff = c(.[1:2], .[3] - sum(.[1:2]), diff(.[-1:-2]))))
##   SiteID measured modelled measured_diff modelled_diff
## 1  site1       50       47            50            47
## 2  site2       28       30            28            30
## 3  site3      158      162            80            85
## 4  site4      247      243            89            81
## 5  site5      456      463           209           220
## 6  site6      573      564           117           101
## 7  site7      634      640            61            76
3
votes

Here is an option using data.table

library(data.table)
setDT(df)[ , paste0("diff_", names(df)[-1]) := lapply(.SD, function(x) 
       c(x[1:2], x[3]- sum(x[1:2]), na.omit(shift(x, type="lead")-
                    x)[-(1:2)]))  , .SDcols = -1]
df
#   SiteID measured modelled diff_measured diff_modelled
#1:  site1       50       47            50            47
#2:  site2       28       30            28            30
#3:  site3      158      162            80            85
#4:  site4      247      243            89            81
#5:  site5      456      463           209           220
#6:  site6      573      564           117           101
#7:  site7      634      640            61            76

Or with diff

setDT(df)[, aste0("diff_", names(df)[-1]) := 
  lapply(.SD, function(x) c(x[1:2], x[3]- sum(x[1:2]), tail(diff(x),-2))), .SDcols = -1]

Or using base R

df[paste0("diff_", names(df)[-1])] <- lapply(df[-1], function(x) 
                    c(x[(1:2)], x[3]- sum(x[1:2]), tail(diff(x), -2)))