I was working on this today for a data.frame (really a data.table) with millions of observations and 35 columns. My goal was to return a list of data.frames (data.tables) each with a single row. That is, I wanted to split each row into a separate data.frame and store these in a list.
Here are two methods I came up with that were roughly 3 times faster than split(dat, seq_len(nrow(dat)))
for that data set. Below, I benchmark the three methods on a 7500 row, 5 column data set (iris repeated 50 times).
library(data.table)
library(microbenchmark)
microbenchmark(
split={dat1 <- split(dat, seq_len(nrow(dat)))},
setDF={dat2 <- lapply(seq_len(nrow(dat)),
function(i) setDF(lapply(dat, "[", i)))},
attrDT={dat3 <- lapply(seq_len(nrow(dat)),
function(i) {
tmp <- lapply(dat, "[", i)
attr(tmp, "class") <- c("data.table", "data.frame")
setDF(tmp)
})},
datList = {datL <- lapply(seq_len(nrow(dat)),
function(i) lapply(dat, "[", i))},
times=20
)
This returns
Unit: milliseconds
expr min lq mean median uq max neval
split 861.8126 889.1849 973.5294 943.2288 1041.7206 1250.6150 20
setDF 459.0577 466.3432 511.2656 482.1943 500.6958 750.6635 20
attrDT 399.1999 409.6316 461.6454 422.5436 490.5620 717.6355 20
datList 192.1175 201.9896 241.4726 208.4535 246.4299 411.2097 20
While the differences are not as large as in my previous test, the straight setDF
method is significantly faster at all levels of the distribution of runs with max(setDF) < min(split) and the attr
method is typically more than twice as fast.
A fourth method is the extreme champion, which is a simple nested lapply
, returning a nested list. This method exemplifies the cost of constructing a data.frame from a list. Moreover, all methods I tried with the data.frame
function were roughly an order of magnitude slower than the data.table
techniques.
data
dat <- vector("list", 50)
for(i in 1:50) dat[[i]] <- iris
dat <- setDF(rbindlist(dat))