1
votes

I'm trying to get a better understanding of performance in for loops in R. I modified the example from Hadley's book here but I'm still confused.

I have the following set-up, where the for loop goes over several random columns:

set.seed(123)
df <- as.data.frame(matrix(runif(1e3), ncol = 10))
cols <- sample(names(df), 2)
tracemem(df)

I have a for loop that runs for every element of cols.

  for (i in seq_along(cols)) {
      df[[cols[i]]] <- 3.2
  }

I get the following list of copies.

tracemem[0x1c54040 -> 0x20e1470]: 
tracemem[0x20e1470 -> 0x20e17b8]: [[<-.data.frame [[<- 
tracemem[0x20e17b8 -> 0x20dc4b8]: [[<-.data.frame [[<- 
tracemem[0x20dc4b8 -> 0x20dc800]: 
tracemem[0x20dc800 -> 0x20dc8a8]: [[<-.data.frame [[<- 
tracemem[0x20dc8a8 -> 0x20dcaa0]: [[<-.data.frame [[<- 

Hadley notes in his example:

In fact, each iteration copies the data frame not once, not twice, but three times! Two copies are made by [[.data.frame, and a further copy is made because [[.data.frame is a regular function that increments the reference count of x.

Can someone explain why the [[<-.data.frame method needs to make two copies?

1
I'm fairly sure the reason is the same as discussed in this answer; namely you get one copy for changing the value and another copy from class<- at the end, since it is being called on a NAM(2) object at that point. That answer also suggests that R has improved things over time, so you're seeing both shallow & deep copies, and so not every "copy" necessarily incurs as large a performance hit.joran

1 Answers

1
votes

This isn't really a complete answer to your question, but it's a start.

If you look in the R Language Definition, you'll see that df[["name"]] <- 3.2 is implemented as

`*tmp*` <- df
df <- "[[<-.data.frame"(`*tmp*`, "name", value=3.2)
rm(`*tmp*`)

So one copy gets put into *tmp*. If you call debug("[[<-.data.frame"), you'll see that it really does get called with an argument called *tmp*, and tracemem() will show that the first duplication happens before you enter.

The function [[<-.data.frame is a regular function with a header like this:

function (x, i, j, value)  

That function gets called as

`[[<-.data.frame`(`*tmp*`, "name", value = 3.2)

Now there are three references to the dataframe: df in the global environment, *tmp* in the internal code, and x in that function. (Actually, there's an intermediate step where the generic is called, but it is a primitive, so it doesn't need to make a new reference.)

The class of x gets changed in the function; that triggers a copy. Then one of the components of x is changed; that's another copy. So that makes 3.

Just guessing, I'd say the reason for the first duplication is that a complicated replacement might refer to the original value, and it's avoiding the possibility of retrieving a partially modified value.