By v1.9.2
, rbindlist
had evolved quite a bit, implementing many features including:
- Choosing the highest
SEXPTYPE
of columns while binding - implemented in v1.9.2
closing FR #2456 and Bug #4981.
- Handling
factor
columns properly - first implemented in v1.8.10
closing Bug #2650 and extended to binding ordered factors carefully in v1.9.2
as well, closing FR #4856 and Bug #5019.
In addition, in v1.9.2
, rbind.data.table
also gained a fill
argument, that allows to bind by filling missing columns, implemented in R.
Now in v1.9.3
, there are even more improvements on these existing features:
rbindlist
gains an argument use.names
, which by default is FALSE
for backwards compatibility.
rbindlist
also gains an argument fill
, which by default is also FALSE
for backwards compatibility.
- These features are all implemented in C, and written carefully to not compromise in speed while adding functionalities.
- Since
rbindlist
can now match by names and fill missing columns, rbind.data.table
just calls rbindlist
now. The only difference is that use.names=TRUE
by default for rbind.data.table
, for backwards compatibility.
rbind.data.frame
slows down quite a bit mostly due to copies (which @mnel points out as well) that could be avoided (by moving to C). I think that's not the only reason. The implementation for checking/matching column names in rbind.data.frame
could also get slower when there are many columns per data.frame and there are many such data.frames to bind (as shown in the benchmark below).
However, that rbindlist
lack(ed) certain features (like checking factor levels or matching names) bears very tiny (or no) weight towards it being faster than rbind.data.frame
. It's because they were carefully implemented in C, optimised for speed and memory.
Here's a benchmark that highlights the efficient binding while matching by column names as well using rbindlist
's use.names
feature from v1.9.3
. The data set consists of 10000 data.frames each of size 10*500.
NB: this benchmark has been updated to include a comparison to dplyr
's bind_rows
library(data.table) # 1.11.5, 2018-06-02 00:09:06 UTC
library(dplyr) # 0.7.5.9000, 2018-06-12 01:41:40 UTC
set.seed(1L)
names = paste0("V", 1:500)
cols = 500L
foo <- function() {
data = as.data.frame(setDT(lapply(1:cols, function(x) sample(10))))
setnames(data, sample(names))
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
.Call("Csetlistelt", ll, i, foo())
}
system.time(ans1 <- rbindlist(ll))
# user system elapsed
# 1.226 0.070 1.296
system.time(ans2 <- rbindlist(ll, use.names=TRUE))
# user system elapsed
# 2.635 0.129 2.772
system.time(ans3 <- do.call("rbind", ll))
# user system elapsed
# 36.932 1.628 38.594
system.time(ans4 <- bind_rows(ll))
# user system elapsed
# 48.754 0.384 49.224
identical(ans2, setDT(ans3))
# [1] TRUE
identical(ans2, setDT(ans4))
# [1] TRUE
Binding columns as such without checking for names took just 1.3 where as checking for column names and binding appropriately took just 1.5 seconds more. Compared to base solution, this is 14x faster, and 18x faster than dplyr
's version.