5
votes

I've got a list (length = 5000) of tibbles that I want to merge. They've all got the same columns so I thought of merging using dplyr::bind_rows. On the face of it binding rows per each added tibble is very quick, however the execution time increases exponentially instead of linearly as more tibbles are being added. Having done some googling, it's very much like the bug observed here: https://github.com/tidyverse/dplyr/issues/1396. Even though the bug is supposed to have been fixed in the bind_rows internals, I'm still seeing an exponential increase in elapsed time per tibble.

library(foreach)
library(tidyverse)
set.seed(123456)
tibbles <- foreach(i = 1:200) %do% {
              tibble(a = rnorm(10000), 
                     b = rep(letters[1:25], 400), 
                     c = rnorm(10000))
}
times <- foreach(i = 1:200) %do% {
            system.time(tibbles[1:i] %>% 
                            purrr::reduce(bind_rows))
}

times %>% 
     map_dbl(.f = ~.x[3]) %>% 
     plot(ylab = "time [s] per added tibble")

Any ideas why this is the case and how to solve it?

Thanks.

elapsed_time

2
growing data frame row by row is inefficient. data.table can help you better here. - abhiieor
I may be misunderstanding, but can't you use bind_rows on the tibble directly instead of via purrr:reduce? Things look linear to me with that change. - aosmith
No, it was me who misunderstood - I didn't know you could use it on a list. Thanks for that. Got the job done in no time at all. - biomiha

2 Answers

3
votes

My guess is that every time you call rbind, R has to allocate a new set of columns and copy the data over. This would lead to a quadratic increase in time.

Try pre-allocating the columns instead:

system.time({
n <- vapply(tibbles, nrow, 0)
ntot <- sum(n)
cols <- list(a = numeric(ntot), b = character(ntot), c = numeric(ntot))

off <- 0
for (i in seq_along(tibbles)) {
    ix <- off + seq_len(n[[i]])
    for (j in seq_along(cols)) {
        cols[[j]][ix] <- tibbles[[i]][[j]]
    }
    off <- off + n[[i]]
}

result <- as_tibble(cols)
})
#>    user  system elapsed 
#>   0.073   0.012   0.085

Compare with the purrr::reduce approach:

system.time(tibbles[1:200] %>% purrr::reduce(bind_rows))
#>   user  system elapsed 
#>  4.888   2.013   6.928 

Although, as aosmith notes, in your situation it's better to just use bind_rows:

system.time(result <- bind_rows(tibbles))
#>  user  system elapsed 
#> 0.039   0.005   0.044 
0
votes

To expand on abhiieor's comment, I think that rbindlist or rbind from data.table might be helpful. Assuming you are trying to bind the rows of a series of tibbles (or data.tables) this code is nearly instant.

time <- proc.time()

data_tables <- foreach(i = 1:200) %do% {
  data.table(a = rnorm(10000), 
         b = rep(letters[1:25], 400), 
         c = rnorm(10000))
}

all_tables <- rbindlist(data_tables)

end_time <- proc.time() -  time