1
votes

Representative sample data (list of lists):

l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr", 
    score = -0.21104594634643), .Names = c("id", "label", 
"link", "score")), e = 49.1279871269422), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.934821052832427, 
b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", 
    link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 9L, label = "Nebraska", link = "America/Scoresbysund", 
    score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", 
"label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", 
c = "P", d = list(structure(list(id = 8L, label = "Georgia", 
    link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 2L, label = "Washington", link = "America/Shiprock", 
    score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 6L, label = "North Dakota", link = "Universal", 
    score = 1.03168296038975), .Names = c("id", "label", 
"link", "score")), structure(list(id = 1L, label = "New Hampshire", 
    link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", 
"label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", 
    link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", 
"label", "link", "score"))), e = 132.1153538536), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = 0.202685974077313, b = "x", 
c = "O", d = structure(list(id = 3L, label = "Delaware", 
    link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", 
"label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.396243444741009, 
b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", 
    link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 9L, label = "Nebraska", link = "America/Ojinaga", 
    score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", 
"label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", 
"b", "c", "d", "e")))

I have a list of lists, by virtue of a JSON data download.

The list has 176 elements, each with 33 nested elements some of which are also lists of varying length.

I am interested in analyzing the data contained in a particular nested list, which has a length of ~150 for each of the 176 which has either 4 or 5 elements -- some have 4 and some have 5. I am trying to extract this nested list of interest and convert it into a data.frame to be able to perform some analysis.

In the representative sample data above, I am interested in the nested list d for each of the 5 elements of l. The desired data.frame would therefore look something like:

id           label            link       score  externalId
 5            Utah     Asia/Anadyr  -0.2110459          NA
 8  South Carolina  Pacific/Wallis   0.5265409   -6.743544
 .
 .

I've been attempting to use purrr which appears to have a sensible and consistent flow for processing data in lists, but I am running into errors that I can't fully understand the cause of -- could very well be that I don't properly understand the commands/logic of purrr or lists (likely both). This is the code I've been attempting but throws the associated error:

df <- map_df(l, "d", ~as.data.frame(.))
Error: incompatible sizes (5 != 4)

I believe this has to do with the differing lengths of d for each component, or perhaps the differing contained data (sometimes 4 elements sometimes 5) or perhaps the function I've used here is misspecified -- truthfully I'm not entirely sure.

I have worked around this by using a for loop, which I know is inefficient and hence my question here on SO.

This is the for loop I currently employ:

df <- data.frame(id = integer(), label = character(), score = numeric(), externalId = numeric())
for(i in seq_along(l)){
    df_temp <- l[[i]][[4]] %>% map_df(~as.data.frame(.))
    df <- rbind(df, df_temp)
}

Some assistance preferably with purrr - alternatively some version of apply as this is still superior to my for-loop - would be greatly appreciated. Also if there's a resource for the above I'd like to understand rather than just find the right code.

2

2 Answers

8
votes

You can do this in three steps, first pulling out d, then binding the rows within each element of d, and then binding everything into a single object.

I use bind_rows from dplyr for the within-list row binding. map_df does the final row binding.

library(purrr)
library(dplyr)

l %>%
    map("d") %>%
    map_df(bind_rows)

This is also equivalent:

map_df(l, ~bind_rows(.x[["d"]] ) )

The result looks like:

# A tibble: 12 x 5
      id          label                 link      score externalId
   <int>          <chr>                <chr>      <dbl>      <dbl>
 1     5           Utah          Asia/Anadyr -0.2110459         NA
 2     8 South Carolina       Pacific/Wallis  0.5265409  -6.743544
 3     9       Nebraska America/Scoresbysund  0.2508955  16.425747
 4     8        Georgia         America/Nome  0.5264941   7.915836
 5     2     Washington     America/Shiprock -0.5551864  15.068666
 6     6   North Dakota            Universal  1.0316830         NA
 7     1  New Hampshire      America/Cordoba  1.2158206   9.727642
 8     1         Alaska        Asia/Istanbul -0.2318326         NA
 9     4   Pennsylvania Africa/Dar_es_Salaam  0.5902453         NA
10     3       Delaware       Asia/Samarkand  0.6955771  15.236482
11     4   North Dakota      America/Tortola  1.0306027  -7.216669
12     9       Nebraska      America/Ojinaga -1.1139800  -8.451451
0
votes

For more information on purrr, I recommend Grolemund and Wickham's "R for Data Science" http://r4ds.had.co.nz/

I think one issue you are facing is that some of the items in l$d are lists of variables with one observation each, ready to be converted to data frames, while other items are lists of such lists.

But I'm not that good at purrr myself. Here's how I would do it:

l <- lapply(l, function(x){x$d}) ## work with the data you need.

list_of_observations <- Filter(function(x) {!is.null(names(x))},l)

list_of_lists <- Filter(function(x) {is.null(names(x))}, l)

another_list_of_observations <- unlist(list_of_lists, recursive=FALSE)

df <- lapply(c(list_of_observations, another_list_of_observations),
             as.data.frame) %>% bind_rows