2
votes

This is the third question of a similar theme (extracting subset from list of lists into data.frame) -- I am understanding more as I go on but still end up hitting obstacles when the problem changes slightly.

Previous two related questions: Extracting data from hierarchical lists of different lengths into `data.frame` using `purr` Extracting data from a list of lists into its own `data.frame` with `purrr`

Here's the third of a similar flavour -

sample data (representative list of lists):

q <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr", 
    score = -0.21104594634643), .Names = c("id", "label", "link", "score")), sentiment = list(structure(list(text = structure(list(content = "the normal flow of supply chain activities is interrupted,", beginOffset = -1), .Names = c("content", "beginOffset")), sentiment = structure(list(magnitude = 0.300000011920929, score = -0.300000011920929), .Names = c("magnitude", "score"))), .Names = c("text", "sentiment")), structure(list(text = structure(list(content = "companies may experience financial loss, cost increases,", beginOffset = -1), .Names = c("content", "beginOffset")), sentiment = structure(list(magnitude = 0, score = 0), .Names = c("magnitude", "score"))), .Names = c("text", "sentiment")), structure(list(text = structure(list(content = "market share declines, customer defection and damage to", beginOffset = -1), .Names = c("content", "beginOffset")), sentiment = structure(list(magnitude = 0.300000011920929, score = -0.300000011920929), .Names = c("magnitude", "score"))), .Names = c("text", "sentiment")))), .Names = c("a", "b", "c", "d", "sentiment")), structure(list(a = 7.74576236632992, b = "z", c = "F", d = structure(list(id = 3L, label = "South Carolina", link = "Pacific/Wallis", score = 2.44729194863711), .Names = c("id", "label", "link", "score")), sentiment = list(structure(list(text = structure(list(content = "impacted companies by seven percent, on average.", beginOffset = -1), .Names = c("content", "beginOffset")), sentiment = structure(list(magnitude = 0.300000011920929, score = -0.300000011920929), .Names = c("magnitude", "score"))), .Names = c("text", "sentiment")), structure(list(text = structure(list(content = "today’s shortened product lifecycles, more demanding", beginOffset = -1), .Names = c("content", "beginOffset")), sentiment = structure(list(magnitude = 0, score = 0), .Names = c("magnitude", "score"))), .Names = c("text", "sentiment")), structure(list(text = structure(list(content = "into global markets, mean this approach is no longer", beginOffset = -1), .Names = c("content", "beginOffset")), sentiment = structure(list(magnitude = 0.300000011920929, score = -0.300000011920929), .Names = c("magnitude", "score"))), .Names = c("text", "sentiment")), structure(list(
    text = structure(list(content = "and down rapidly as market conditions change.", beginOffset = -1), .Names = c("content", "beginOffset")), sentiment = structure(list(magnitude = 0, score = 0), .Names = c("magnitude", "score"))), .Names = c("text", "sentiment")), structure(list(text = structure(list(content = "flexible supply chain allows them to both reduce risk and", beginOffset = -1), .Names = c("content", "beginOffset")), sentiment = structure(list(magnitude = 0.5, score = 0.5), .Names = c("magnitude", "score"))), .Names = c("text", "sentiment")))), .Names = c("a", "b", "c", "d", "sentiment")))

I have a large list of lists, courtesy of a JSON extraction. There are various sublists of interest that I am trying to extract into their own tables (data.frame or data.table)

> q %>% map(names)
[[1]]
[1] "a"         "b"         "c"         "d"         "sentiment"
[[2]]
[1] "a"         "b"         "c"         "d"         "sentiment"

In this case I want:
- the 5th element ("sentiment") for each (q[[1]][[5]], q[[2]][[5]], etc)
- and some identification variables ("a", "b") say from the 1st element (q[[1]][[1]], q[[1]][[2]], etc)

The length of the 5th element varies but is always >1 whereas the length of the ID variables (i.e., a, b) is always one.

I've learnt from the previous two questions that these types of tasks are best approached by starting with the most nested elements and then working 'outward' employing the recycling element where necessary (for example with data.frame). The issue I'm having is with organizing the contents within the 5th element into the desired format, here is what I am doing:

> DF <- q %>% 
        map(`[`, c("a", "b", "sentiment")) %>% 
        map(modify_at, "sentiment", bind_rows) %>% 
        map_df(data.frame, stringsAsFactors = F)

When I first bind_rows of the "sentiment" sublist, for each element I get two rows of two variables rolled up, rather than one row of four variables:

head(DF, 2)
   a        b                                             sentiment.text sentiment.sentiment
1 -1.546765 s the normal flow of supply chain activities is interrupted,                 0.3 
2 -1.546765 s                                                         -1                -0.3

I understand this is by virtue of the structure of "sentiment", but I am unsure how to get one level deeper into the "text" and "sentiment" objects which each have two elements "content", "beginOffset" and "magnitude", "score" respectively.

The desired output instead of what is shown with head(DF, 2) would be:

a           b                                     sentiment.text.content sentiment.text.beginOffset sentiment.sentiment.magnitude sentiment.sentiment.score
1 -1.546765 s the normal flow of supply chain activities is interrupted,                         -1                           0.3                      -0.3
1

1 Answers

1
votes

Something like this ?

DF <- q %>% 
  map(`[`, c("a", "b", "sentiment")) %>% 
  map(.%>% modify_at("sentiment",. %>% map(as.data.frame,stringsAsFactors=FALSE) %>%bind_rows)) %>% 
  map_df(data.frame, stringsAsFactors = F)

#           a b                                     sentiment.text.content sentiment.text.beginOffset sentiment.sentiment.magnitude sentiment.sentiment.score
# 1 -1.546765 s the normal flow of supply chain activities is interrupted,                         -1                           0.3                      -0.3
# 2 -1.546765 s   companies may experience financial loss, cost increases,                         -1                           0.0                       0.0
# 3 -1.546765 s    market share declines, customer defection and damage to                         -1                           0.3                      -0.3
# 4  7.745762 z           impacted companies by seven percent, on average.                         -1                           0.3                      -0.3
# 5  7.745762 z       today’s shortened product lifecycles, more demanding                         -1                           0.0                       0.0
# 6  7.745762 z       into global markets, mean this approach is no longer                         -1                           0.3                      -0.3
# 7  7.745762 z              and down rapidly as market conditions change.                         -1                           0.0                       0.0
# 8  7.745762 z  flexible supply chain allows them to both reduce risk and                         -1                           0.5                       0.5

str(DF)
# 'data.frame': 8 obs. of  6 variables:
# $ a                            : num  -1.55 -1.55 -1.55 7.75 7.75 ...
# $ b                            : chr  "s" "s" "s" "z" ...
# $ sentiment.text.content       : chr  "the normal flow of supply chain activities is interrupted," "companies may experience financial loss, cost increases," "market share declines, customer defection and damage to" "impacted companies by seven percent, on average." ...
# $ sentiment.text.beginOffset   : num  -1 -1 -1 -1 -1 -1 -1 -1
# $ sentiment.sentiment.magnitude: num  0.3 0 0.3 0.3 0 ...
# $ sentiment.sentiment.score    : num  -0.3 0 -0.3 -0.3 0 ...