Nice question. In general, I'd benchmark on a data size that's big enough to not fit (almost) entirely in the cache. Have a look here under "initial setup". It really isn't meaningful to compare tools developed for (in-memory) big-data to run tasks that runs in milliseconds. We are planning to benchmark on relatively bigger data in the future.
Additionally if your intent is to find out if mutate
is performing a copy, then all you've to do is to check the address
before and after (this can be done using .Internal(inspect(.))
in base R
or using the function changes()
in dplyr
).
On to whether a copy is being made or not:
There are two different things to be checked here. A) creating a new column, and B) modifying an existing column.
A) Creating a new column:
require(dplyr)
require(data.table)
df <- tbl_df(data.frame(x=1:5, y=6:10))
df2 <- mutate(df, z=1L)
changes(df, df2)
# Changed variables:
# old new
# z 0x105ec36d0
It tells you that there are no changes in the addresses of x
and y
, and points out z
we just added. What's happening here?
dplyr
shallow copies the data.frame
and then has added the new column. A shallow copy as opposed to a deep-copy just copies the vector of column pointers, not the data itself. Therefore it should be fast. Basically df2
is created with 3 columns, where the first two columns are pointing to the same address location as that of df
and the 3rd column was just created.
On the other hand, data.table
doesn't have to shallow copy, as it modifies the column by reference (in-place). data.table
also (cleverly) over-allocates a list of column vectors that allows for fast adding of (new) columns by reference.
There should not be a huge difference in the time to shallow copy as long as you've too many columns. Here's a small benchmark on 5000 columns with 1e4 rows:
require(data.table) # 1.8.11
require(dplyr) # latest commit from github
dt <- as.data.table(lapply(1:5e3, function(x) sample(1e4)))
ans1 <- sapply(1:1e2, function(x) {
dd <- copy(dt) # so as to create the new column each time
system.time(set(dd, i=NULL, j="V1001", value=1L))['elapsed']
# or equivalently of dd[, V1001 := 1L]
})
df <- tbl_df(as.data.frame(dt))
ans2 <- sapply(1:1e2, function(x) {
system.time(mutate(df, V1001 = 1L))['elapsed']
})
> summary(ans1) # data.table
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.00000 0.00100 0.00061 0.00100 0.00100
> summary(ans2) # dplyr
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.03800 0.03900 0.03900 0.04178 0.04100 0.07900
You can see the difference in the "mean time" here (0.00061 vs 0.04178)..
B) Modify an existing column:
df2 <- mutate(df, y=1L)
changes(df, df2)
# Changed variables:
# old new
# y 0x105e5a850 0x105e590e0
It tells you that y
has been changed - a copy of column y
has been made. It had to create a new memory location to change the values of y
, because it was pointing to the same location as that of df
's y
before.
However, since data.table
modifies in place there'll be no copy made in case of (B). It'll modify df
in place. So you should see a performance difference if you are modifying columns.
This is one of the fundamental differences in the philosophies between the two packages. dplyr
doesn't like modifying in-place and therefore trades-off by copying when modifying existing columns.
And because of this, it wouldn't be possible to change values of certain rows of a particular column of a data.frame without a deep-copy. That is:
DT[x >= 5L, y := 1L] # y is an existing column
This can't be done without an entire copy of the data.frame using base R
and dplyr
, to my knowledge.
Also, consider a 2 column dataset of size 20GB (two columns each 10GB) on a machine with 32GB RAM. The data.table
philosophy is to provide a way to change a subset of those 10GB columns by reference, without copying even a single column once. A copy of one column would need an extra 10GB and may fail with out-out-memory, let alone be fast or not. This concept (:=
) is analogous to UPDATE in SQL.
.Call
C API one can potentially alter any object in place. The API does not force the developer to return a new object and all the data of the object passed in is available to the developer via C pointers of theSEXP
structure. I've done it myself for fast in-place manipulation of image data (and no, it is not advisable!). – Oleg Sklyardplyr
was 50% slower thandata.table
in all iterations of your benchmark (100 x on microbenchmark, though on earlier run I got a worse result for themax
case). This is on a windows 64 bit machine. – BrodieGdplyr
version 0.1.1 (2014-02-09), anddata.table
1.8.10. – Beasterfield10.52806 10.91406 11.51819 11.91552 14.73834
, Dplyr:15.69537 16.29676 16.71768 17.43426 24.86194
(min, lq, med, uq, max, milliseconds). – BrodieG