In the book Software for Data Analysis: Programming with R, John Chambers emphasizes that functions should generally not be written for their side effect; rather, that a function should return a value without modifying any variables in its calling environment. Conversely, writing good script using data.table objects should specifically avoid the use of object assignment with <-
, typically used to store the result of a function.
First, is a technical question. Imagine an R function called proc1
that accepts a data.table
object x
as its argument (in addition to, maybe, other parameters). proc1
returns NULL but modifies x
using :=
. From what I understand, proc1
calling proc1(x=x1)
makes a copy of x1
just because of the way that promises work. However, as demonstrated below, the original object x1
is still modified by proc1
. Why/how is this?
> require(data.table)
> x1 <- CJ(1:2, 2:3)
> x1
V1 V2
1: 1 2
2: 1 3
3: 2 2
4: 2 3
> proc1 <- function(x){
+ x[,y:= V1*V2]
+ NULL
+ }
> proc1(x1)
NULL
> x1
V1 V2 y
1: 1 2 2
2: 1 3 3
3: 2 2 4
4: 2 3 6
>
Furthermore, it seems that using proc1(x=x1)
isn't any slower than doing the procedure directly on x, indicating that my vague understanding of promises are wrong and that they work in a pass-by-reference sort of way:
> x1 <- CJ(1:2000, 1:500)
> x1[, paste0("V",3:300) := rnorm(1:nrow(x1))]
> proc1 <- function(x){
+ x[,y:= V1*V2]
+ NULL
+ }
> system.time(proc1(x1))
user system elapsed
0.00 0.02 0.02
> x1 <- CJ(1:2000, 1:500)
> system.time(x1[,y:= V1*V2])
user system elapsed
0.03 0.00 0.03
So, given that passing a data.table argument to a function doesn't add time, that makes it possible to write procedures for data.table objects, incorporating both the speed of data.table and the generalizability of a function. However, given what John Chambers said, that functions should not have side-effects, is it really "ok" to write this type of procedural programming in R? Why was he arguing that side effects are "bad"? If I'm going to ignore his advice, what sort of pitfalls should I be aware of? What can I do to write "good" data.table procedures?
data.table
deliberately departs from R's copy-on-write.data.table
isn't copy-on-write, even within functions. If you really want to copy a 20GB data.table, you need to placex=copy(x)
at the start of the function, or writex=copy(x)[,y:=V1*V2]
inside the function. – Matt Dowle