What is the most efficient way to cast a list as a data frame?

47

votes

Very often I want to convert a list wherein each index has identical element types to a data frame. For example, I may have a list:

> my.list
[[1]]
[[1]]$global_stdev_ppb
[1] 24267673

[[1]]$range
[1] 0.03114799

[[1]]$tok
[1] "hello"

[[1]]$global_freq_ppb
[1] 211592.6


[[2]]
[[2]]$global_stdev_ppb
[1] 11561448

[[2]]$range
[1] 0.08870838

[[2]]$tok
[1] "world"

[[2]]$global_freq_ppb
[1] 1002043

I want to convert this list to a data frame where each index element is a column. The natural (to me) thing to go is to is use do.call:

> my.matrix<-do.call("rbind", my.list)
> my.matrix
     global_stdev_ppb range      tok     global_freq_ppb
[1,] 24267673         0.03114799 "hello" 211592.6       
[2,] 11561448         0.08870838 "world" 1002043

Straightforward enough, but when I attempt to cast this matrix as a data frame, the columns remain list elements, rather than vectors:

> my.df<-as.data.frame(my.matrix, stringsAsFactors=FALSE)
> my.df[,1]
[[1]]
[1] 24267673

[[2]]
[1] 11561448

Currently, to get the data frame cast properly I am iterating over each column using unlist and as.vector, then recasting the data frame as such:

new.list<-lapply(1:ncol(my.matrix), function(x) as.vector(unlist(my.matrix[,x])))
my.df<-as.data.frame(do.call(cbind, new.list), stringsAsFactors=FALSE)

This, however, seem very inefficient. Is there are better way to do this?

listr dataframe

see ?data.table::rbindlist - marbel

As of 2017 you should use your_list %>% reduce(bind_rows) from purrr - Zafar

50

votes

I think you want:

> do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE))
  global_stdev_ppb      range   tok global_freq_ppb
1         24267673 0.03114799 hello        211592.6
2         11561448 0.08870838 world       1002043.0
> str(do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE)))
'data.frame':   2 obs. of  4 variables:
 $ global_stdev_ppb: num  24267673 11561448
 $ range           : num  0.0311 0.0887
 $ tok             : chr  "hello" "world"
 $ global_freq_ppb : num  211593 1002043

31

votes

Another option is:

data.frame(t(sapply(mylist, `[`)))

but this simple manipulation results in a data frame of lists:

> str(data.frame(t(sapply(mylist, `[`))))
'data.frame':   2 obs. of  3 variables:
 $ a:List of 2
  ..$ : num 1
  ..$ : num 2
 $ b:List of 2
  ..$ : num 2
  ..$ : num 3
 $ c:List of 2
  ..$ : chr "a"
  ..$ : chr "b"

An alternative to this, along the same lines but now the result same as the other solutions, is:

data.frame(lapply(data.frame(t(sapply(mylist, `[`))), unlist))

[Edit: included timings of @Martin Morgan's two solutions, which have the edge over the other solution that return a data frame of vectors.] Some representative timings on a very simple problem:

mylist <- list(list(a = 1, b = 2, c = "a"), list(a = 2, b = 3, c = "b"))

> ## @Joshua Ulrich's solution:
> system.time(replicate(1000, do.call(rbind, lapply(mylist, data.frame,
+                                     stringsAsFactors=FALSE))))
   user  system elapsed 
  1.740   0.001   1.750

> ## @JD Long's solution:
> system.time(replicate(1000, do.call(rbind, lapply(mylist, data.frame))))
   user  system elapsed 
  2.308   0.002   2.339

> ## my sapply solution No.1:
> system.time(replicate(1000, data.frame(t(sapply(mylist, `[`)))))
   user  system elapsed 
  0.296   0.000   0.301

> ## my sapply solution No.2:
> system.time(replicate(1000, data.frame(lapply(data.frame(t(sapply(mylist, `[`))), 
+                                               unlist))))
   user  system elapsed 
  1.067   0.001   1.091

> ## @Martin Morgan's Map() sapply() solution:
> f = function(x) function(i) sapply(x, `[[`, i)
> system.time(replicate(1000, as.data.frame(Map(f(mylist), names(mylist[[1]])))))
   user  system elapsed 
  0.775   0.000   0.778

> ## @Martin Morgan's Map() lapply() unlist() solution:
> f = function(x) function(i) unlist(lapply(x, `[[`, i), use.names=FALSE)
> system.time(replicate(1000, as.data.frame(Map(f(mylist), names(mylist[[1]])))))
   user  system elapsed 
  0.653   0.000   0.658

18

votes

I can't tell you this is the "most efficient" in terms of memory or speed, but it's pretty efficient in terms of coding:

my.df <- do.call("rbind", lapply(my.list, data.frame))

the lapply() step with data.frame() turns each list item into a single row data frame which then acts nice with rbind()

16

votes

Although this question has long since been answered, it's worth pointing out the data.table package has rbindlist which accomplishes this task very quickly:

library(microbenchmark)
library(data.table)
l <- replicate(1E4, list(a=runif(1), b=runif(1), c=runif(1)), simplify=FALSE)

microbenchmark( times=5,
  R=as.data.frame(Map(f(l), names(l[[1]]))),
  dt=data.frame(rbindlist(l))
)

gives me

Unit: milliseconds
 expr       min        lq    median        uq       max neval
    R 31.060119 31.403943 32.278537 32.370004 33.932700     5
   dt  2.271059  2.273157  2.600976  2.635001  2.729421     5

13

votes

This

f = function(x) function(i) sapply(x, `[[`, i)

is a function that returns a function that extracts the i'th element of x. So

Map(f(mylist), names(mylist[[1]]))

gets a named (thanks Map!) list of vectors that can be made into a data frame

as.data.frame(Map(f(mylist), names(mylist[[1]])))

For speed it's usually faster to use unlist(lapply(...), use.names=FALSE) as

f = function(x) function(i) unlist(lapply(x, `[[`, i), use.names=FALSE)

A more general variant is

f = function(X, FUN) function(...) sapply(X, FUN, ...)

When do the list-of-lists structures come up? Maybe there's an earlier step where an iteration could be replaced by something more vectorized?

3

votes

The dplyr package's bind_rows is efficient.

one <- mtcars[1:4, ]
two <- mtcars[11:14, ]
system.time(dplyr::bind_rows(one, two))
   user  system elapsed 
  0.001   0.000   0.001

0

votes

Not sure where they rank as far as efficiency, but depending on the structure of your lists there are some tidyverse options. A bonus is that they work nicely with unequal length lists:

l <- list(a = list(var.1 = 1, var.2 = 2, var.3 = 3)
        , b = list(var.1 = 4, var.2 = 5)
        , c = list(var.1 = 7, var.3 = 9)
        , d = list(var.1 = 10, var.2 = 11, var.3 = NA))

df <- dplyr::bind_rows(l)
df <- purrr::map_df(l, dplyr::bind_rows)
df <- purrr::map_df(l, ~.x)

# all create the same data frame:
# A tibble: 4 x 3
  var.1 var.2 var.3
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5    NA
3     7    NA     9
4    10    11    NA

And you can also mix vectors and data frames:

library(dplyr)
bind_rows(
  list(a = 1, b = 2),
  data_frame(a = 3:4, b = 5:6),
  c(a = 7)
)

# A tibble: 4 x 2
      a     b
  <dbl> <dbl>
1     1     2
2     3     5
3     4     6
4     7    NA

What is the most efficient way to cast a list as a data frame?

7 Answers