2
votes

Background

I've gotten myself into a situation where one column in a tibble/dataframe consists of a list of integer matrices which have zero or more rows and exactly 2 columns. This column happens to be the output of a stringr::str_locate_all() invocation, so I expect this is a common scenario.

What I would like to do is to select only one of the columns of the integer matrices and then unnest the dataframe, but I am getting confused about how to do this properly.

Example

Here's an example (I have to create it manually because dpasta() doesn't seem to work with list column tibbles). In any case, my starting point, is the tibble mydf:

library(tidyverse)

m1 <- matrix( c(761,784),             nrow=1,ncol=2, dimnames = list(c(),c("start","end")) )
m2 <- matrix( integer(0),             nrow=0,ncol=2, dimnames = list(c(),c("start","end")) )
m3 <- matrix( c(1001,2300,1010,2310), nrow=2,ncol=2, dimnames = list(c(),c("start","end")) )

mydf <- tibble( item = c("a","b","c"), pos = list(m1,m2,m3))

Below is what that looks like in the rstudio viewer. It's kind of misleading because it suggests that the pos rows are just vectors of integers. They're actually nx2 matrices and there isn't any cue that indicates it's more complex. It caused me some confusion, but that's beside the point now.

starting point

What I would like to do is end up with an unnested tibble where the 1st column, "start", is selected. The desired output would look like this (after unnesting):

mydf_desired <- tibble( item = c("a","c","c"), start_pos = c(761,1001,2300))

desired outcome

Note that the first row in mydf had only one row in it's pos matrix, so it has one row in the desired result. The row with item="b" had a 0x2 matrix, so it doesn't appear (but it would have been OK if it appeared as an NA too). The row with item="c" had two rows in the pos matrix, so it has two rows in the desired result.

What I tried

This seems simple enough, I've unnested list columns before. The only twist here is that I have to first select the "start" column and then unnest, right? I just map the pos list column to [,1] to pick off the 1st column (the "start" column). And then it should be a matter of unnesting...

mydf_desired <- mydf %>% 
                mutate(start_pos = map(pos, ~ .[,1])) %>% 
                unnest()
#> Error in vec_rbind(!!!x, .ptype = ptype): Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
#> Warning: `cols` is now required.
#> Please use `cols = c(pos, start_pos)`

No idea what "value should have been recycled to fit x" actually means, but it's also giving me a warning about not giving cols in unnest(). The suspicion is now something about what I am giving unnest().

If I omit unnest() I don't get that error...

mydf_desired <- mydf %>% 
                mutate(start_pos = map(pos, ~ .[,1]))

And the output looks like this...

before unnest

That sort of looks OK, I notice there's still a pos entry for item=b of integer(0). But even if I omit that row, I get the same error when I try to unnest().

Here's where I am stumped. Why can't I just unnest() this tibble? What is the meaning of the value should have been recycled to fit x error?

2

2 Answers

3
votes

One option is to filter the rows, then map over the list element and extract the column from the matrix, and use unnest_longer

library(dplyr)
library(purrr)
mydf %>% 
   filter(lengths(pos) > 0) %>%
   transmute(item, start_pos = map(pos, ~ as.vector(.x[,1]))) %>% 
   unnest_longer(c(start_pos))
# A tibble: 3 x 2
#  item  start_pos
#  <chr>     <dbl>
#1 a           761
#2 c          1001
#3 c          2300

Also, can avoid the filter step, if we convert to tibble

mydf %>%
   transmute(item, pos = map(pos, ~ .x[,1] %>%
                          tibble(start_pos = .))) %>%
   unnest(c(pos))
1
votes

The error comes because unnest is trying to unnest pos column. You can specify which columns you want to unnest explicitly to avoid the error.

library(dplyr)
library(purrr)

mydf %>% mutate(start_pos = map(pos, ~.[, 1])) %>% unnest(start_pos)

# A tibble: 3 x 3
#  item  pos               start_pos
#  <chr> <list>                <dbl>
#1 a     <dbl[,2] [1 × 2]>       761
#2 c     <dbl[,2] [2 × 2]>      1001
#3 c     <dbl[,2] [2 × 2]>      2300

If you want NA for "b" item you can use unnest_longer

mydf %>% 
   mutate(start_pos = map(pos, ~.[, 1])) %>% 
   unnest_longer(start_pos, indices_include = FALSE)

# A tibble: 4 x 3
#  item  pos               start_pos
#  <chr> <list>                <dbl>
#1 a     <dbl[,2] [1 × 2]>       761
#2 b     <int[,2] [0 × 2]>        NA
#3 c     <dbl[,2] [2 × 2]>      1001
#4 c     <dbl[,2] [2 × 2]>      2300

Or unnest with keep_empty = TRUE.

mydf %>%
  mutate(start_pos = map(pos, ~.[, 1])) %>%
  unnest(start_pos, keep_empty = TRUE)