0
votes

I have a very simple question about referencing data columns within a nested dataframe.

For a reproducible example, I'll nest mtcars by the two values of variable am:

library(tidyverse)
mtcars_nested <- mtcars %>% 
  group_by(am) %>% 
  nest()
mtcars_nested

which gives data that looks like this.

#> # A tibble: 2 x 2
#> # Groups:   am [2]
#>      am data              
#>   <dbl> <list>            
#> 1     1 <tibble [13 × 10]>
#> 2     0 <tibble [19 × 10]>

If I now wanted to use purrr::map to take the mean of mpg for each level of am

I wonder why this doesn't work:


take_mean_mpg <- function(df){
  mean(df[["data"]]$mpg)
}

map(mtcars_nested, take_mean_mpg)
Error in df[["data"]] : subscript out of bounds

Or maybe a simpler question is: How should I properly reference the mpg column, once it's nested. I know that this doesn't work:

mtcars_nested[["data"]]$mpg
2

2 Answers

2
votes

dataframes (and tbls) are lists of columns, not lists of rows, so when you pass the whole tbl mtcars_nest to map() it is iterating over the columns not over the rows. You can use mutate with your function, and map_dbl so that your new columns is not a list column.

library(tidyverse)
mtcars_nested <- mtcars %>% 
  group_by(am) %>% 
  nest()
mtcars_nested

take_mean_mpg <- function(df){
  mean(df$mpg)
}

mtcars_nested %>%
  mutate(mean_mpg = map_dbl(.data[["data"]], take_mean_mpg))

The .data[["data"]] argument to map_dbl() gives it the data list column from you dataframe to iterate over, rather than the entire dataframe. The .data part of the argument has no relation to your column named "data", it is the rlang pronoun .data to reference your whole dataframe. [["data"]] then retrieves the column named "data" from your dataframe. You use mutate because you are trying (I assumed, perhaps incorrectly) to add a column with the averages to the nested dataframe. mutate() is used to add columns, so you add a column equal to the output of map() (or map_dbl()) with your function, which will return the list (or vector) of averages.

This can me a confusing concept. Although map() is often used to iterate over the rows of a dataframe, it technically iterates over a list (see the documentation, where under the arguments it says:

.x A list or atomic vector.

It also returns a list or a vector. The good news is that columns are just lists of values, so you pass it the list (column) you want it to iterate over and assign it to the list (column) where you want it stored (this assignment happens with mutate()).

1
votes

You should pass mtcars_nested$data in map and take mean of mpg column.

take_mean_mpg <- function(df){
     mean(df$mpg)
}

purrr::map(mtcars_nested$data, take_mean_mpg)
#[[1]]
#[1] 24.39231

#[[2]]
#[1] 17.14737