I'm trying to get a better understanding of performance in for loops in R. I modified the example from Hadley's book here but I'm still confused.
I have the following set-up, where the for loop goes over several random columns:
set.seed(123)
df <- as.data.frame(matrix(runif(1e3), ncol = 10))
cols <- sample(names(df), 2)
tracemem(df)
I have a for loop that runs for every element of cols
.
for (i in seq_along(cols)) {
df[[cols[i]]] <- 3.2
}
I get the following list of copies.
tracemem[0x1c54040 -> 0x20e1470]:
tracemem[0x20e1470 -> 0x20e17b8]: [[<-.data.frame [[<-
tracemem[0x20e17b8 -> 0x20dc4b8]: [[<-.data.frame [[<-
tracemem[0x20dc4b8 -> 0x20dc800]:
tracemem[0x20dc800 -> 0x20dc8a8]: [[<-.data.frame [[<-
tracemem[0x20dc8a8 -> 0x20dcaa0]: [[<-.data.frame [[<-
Hadley notes in his example:
In fact, each iteration copies the data frame not once, not twice, but three times! Two copies are made by [[.data.frame, and a further copy is made because [[.data.frame is a regular function that increments the reference count of x.
Can someone explain why the [[<-.data.frame
method needs to make two copies?
class<-
at the end, since it is being called on a NAM(2) object at that point. That answer also suggests that R has improved things over time, so you're seeing both shallow & deep copies, and so not every "copy" necessarily incurs as large a performance hit. – joran