19
votes

What I really like about data.table is the := idiom for changing the table by reference, without the need for costly copies. From what I understand, this is one of the aspects which makes data.table so ultrafast compared to other methods.

Now, I started playing around with the dplyr package which seems to be equally performant. But since results still have to be assigned using the <- operator, I was expecting a performance drain at this level. However, there seems to be none.

As an example:

library(dplyr)
library(Lahman)
library(microbenchmark)
library(ggplot2)

df <- Batting[ c("yearID", "teamID", "G_batting") ]

mb <- microbenchmark(
  dplyr = {
    tb <- tbl_df( df )
    tb <- tb %.%
      group_by( yearID, teamID ) %.%
      mutate( G_batting = max(G_batting) )
  },
  data.table = {
    dt <- as.data.table( df )
    dt[ , G_batting := max(G_batting), by = list( yearID, teamID ) ]
  },
  times = 500
)

qplot( data = mb, x = expr, y = time * 1E-6, geom = "boxplot", ylab="time [ms]", xlab = "approach" )

enter image description here

I am just wondering how this is possible? Or is there a conceptual mistake in the way I benchmark? Is my understanding of <- wrong?

2
Not to say that this is necessarily the case here, but using a .Call C API one can potentially alter any object in place. The API does not force the developer to return a new object and all the data of the object passed in is available to the developer via C pointers of the SEXP structure. I've done it myself for fast in-place manipulation of image data (and no, it is not advisable!).Oleg Sklyar
For whatever it's worth, for me dplyr was 50% slower than data.table in all iterations of your benchmark (100 x on microbenchmark, though on earlier run I got a worse result for the max case). This is on a windows 64 bit machine.BrodieG
@BrodieG interesting. For completeness this benchmarks were carried out under R 3.0.2, on a Mac (x86_64-apple-darwin10.8.0 (64-bit)), dplyrversion 0.1.1 (2014-02-09), and data.table 1.8.10.Beasterfield
I'm also on 3.0.2 / 0.1.1 / 1.8.10. Data Table timings: 10.52806 10.91406 11.51819 11.91552 14.73834, Dplyr: 15.69537 16.29676 16.71768 17.43426 24.86194 (min, lq, med, uq, max, milliseconds).BrodieG
Dataset needs to be much bigger, replications needs to be much smaller. Otherwise it's comparing overhead, not the task itself.Matt Dowle

2 Answers

22
votes

Nice question. In general, I'd benchmark on a data size that's big enough to not fit (almost) entirely in the cache. Have a look here under "initial setup". It really isn't meaningful to compare tools developed for (in-memory) big-data to run tasks that runs in milliseconds. We are planning to benchmark on relatively bigger data in the future.

Additionally if your intent is to find out if mutate is performing a copy, then all you've to do is to check the address before and after (this can be done using .Internal(inspect(.)) in base R or using the function changes() in dplyr).


On to whether a copy is being made or not:

There are two different things to be checked here. A) creating a new column, and B) modifying an existing column.

A) Creating a new column:

require(dplyr)
require(data.table)
df <- tbl_df(data.frame(x=1:5, y=6:10))

df2 <- mutate(df, z=1L)
changes(df, df2)
# Changed variables:
#           old new
# z             0x105ec36d0

It tells you that there are no changes in the addresses of x and y, and points out z we just added. What's happening here?

dplyr shallow copies the data.frame and then has added the new column. A shallow copy as opposed to a deep-copy just copies the vector of column pointers, not the data itself. Therefore it should be fast. Basically df2 is created with 3 columns, where the first two columns are pointing to the same address location as that of df and the 3rd column was just created.

On the other hand, data.table doesn't have to shallow copy, as it modifies the column by reference (in-place). data.table also (cleverly) over-allocates a list of column vectors that allows for fast adding of (new) columns by reference.

There should not be a huge difference in the time to shallow copy as long as you've too many columns. Here's a small benchmark on 5000 columns with 1e4 rows:

require(data.table) # 1.8.11
require(dplyr)      # latest commit from github

dt <- as.data.table(lapply(1:5e3, function(x) sample(1e4)))
ans1 <- sapply(1:1e2, function(x) {
    dd <- copy(dt) # so as to create the new column each time
    system.time(set(dd, i=NULL, j="V1001", value=1L))['elapsed'] 
    # or equivalently of dd[, V1001 := 1L]
})

df <- tbl_df(as.data.frame(dt))
ans2 <- sapply(1:1e2, function(x) {
    system.time(mutate(df, V1001 = 1L))['elapsed']
})
> summary(ans1) # data.table
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00000 0.00000 0.00100 0.00061 0.00100 0.00100
> summary(ans2) # dplyr
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.03800 0.03900 0.03900 0.04178 0.04100 0.07900

You can see the difference in the "mean time" here (0.00061 vs 0.04178)..

B) Modify an existing column:

df2 <- mutate(df, y=1L)
changes(df, df2)
# Changed variables:
#           old         new
# y         0x105e5a850 0x105e590e0 

It tells you that y has been changed - a copy of column y has been made. It had to create a new memory location to change the values of y, because it was pointing to the same location as that of df's y before.

However, since data.table modifies in place there'll be no copy made in case of (B). It'll modify df in place. So you should see a performance difference if you are modifying columns.

This is one of the fundamental differences in the philosophies between the two packages. dplyr doesn't like modifying in-place and therefore trades-off by copying when modifying existing columns.

And because of this, it wouldn't be possible to change values of certain rows of a particular column of a data.frame without a deep-copy. That is:

DT[x >= 5L, y := 1L] # y is an existing column

This can't be done without an entire copy of the data.frame using base R and dplyr, to my knowledge.


Also, consider a 2 column dataset of size 20GB (two columns each 10GB) on a machine with 32GB RAM. The data.table philosophy is to provide a way to change a subset of those 10GB columns by reference, without copying even a single column once. A copy of one column would need an extra 10GB and may fail with out-out-memory, let alone be fast or not. This concept (:=) is analogous to UPDATE in SQL.

11
votes

To understand what's going on, you need to understand what exactly is being copied. A data frame is actually quite cheap to copy because it's basically just an array of pointers to the columns. Doing a shallow copy of a data frame is very cheap, because you just have to copy those pointers.

However, most base R function do a deep copy. So when you do:

df <- data.frame(x = 1:10, y = 1:10)
transform(df, z = x + y)

not only does R copy the data frame, it actually copies each individual column. dplyr provides the changes() function to make this easier to see. For each column in the data frame, it displays the memory location where that column lives. If it's changed, then the complete column has been copied:

df2 <- transform(df, z = x + y)
changes(df, df2)
#> Changed variables:
#>           old            new           
#> x         0x7fb19adcd378 0x7fb19ab9bcb8
#> y         0x7fb19adcd3d0 0x7fb19ab9bd10
#> z                        0x7fb19ab9bd68
#> 
#> Changed attributes:
#>           old            new           
#> names     0x7fb19adcce98 0x7fb1944e4558
#> row.names 0x7fb19ab2bd10 0x7fb19ab2bf20
#> class     0x7fb19ad5d208 0x7fb19ab51b28

If we do the same thing in dplyr, the original columns aren't copied:

df3 <- dplyr::mutate(df, z = x + y)
changes(df, df3)
#> Changed variables:
#>           old new           
#> z             0x7fb19adcd060
#> 
#> Changed attributes:
#>           old            new           
#> names     0x7fb19adcce98 0x7fb1944e8b18
#> row.names 0x7fb19ab9c0d8 0x7fb19ab9c340
#> class     0x7fb19ad5d208 0x7fb19ad69408

This makes dplyr much faster than base R.

Data.table is a little bit faster again, because it allows you to modify the data table in place - it doesn't even have to copy the pointers to the columns. I think not modifying in place makes dplyr a little easier to understand (because it perserves usual R semantics), at the cost of being a little slower (but the cost grows with the number of columns, not number of rows).