1
votes

I am teaching myself the R tidyverse purr() package and am having trouble implementing map() on a column of nested data frames. Could someone explain what I'm missing?

Using the base R ChickWeight dataset as an example I can easily get the number of observations for each timepoint under diet #1 if I first filter for diet #1 like so:

library(tidyverse) 
ChickWeight %>%
  filter(Diet == 1) %>% 
  group_by(Time) %>% 
  summarise(counts = n_distinct(Chick))

This is great but I would like to do it for each diet at once and I thought nesting the data and iterating over it with map() would be a good approach. This is what I did:

example <- ChickWeight %>% 
  nest(-Diet) 

Implementing this map function then achieves what I'm aiming for:

map(example$data, ~ .x %>% group_by(Time) %>% summarise(counts = n_distinct(Chick))) 

However when I try and implement this same command using a pipe to put it in another column of the original data frame it fails.

example %>% 
   mutate(counts = map(data, ~ .x %>% group_by(Time) %>%  summarise(counts = n_distinct(Chick))))
Error in eval(substitute(expr), envir, enclos) : 
  variable 'Chick' not found

Why does this occur?


I also tried it on the data frame split into a list and it didn't work.

ChickWeight %>% 
  split(.$Diet) %>% 
  map(data, ~ .x %>% group_by(Time) %>%  summarise(counts = n_distinct(Chick)))
1

1 Answers

6
votes

Because you're using dplyr non-standard evaluation inside of dplyr NSE, it's getting confused about what environment to search for Chick. It's probably a bug, really, but it can be avoided with the development version's new .data pronoun, which specifies where to look:

library(tidyverse)

ChickWeight %>% 
    nest(-Diet) %>% 
    mutate(counts = map(data, 
                        ~.x %>% group_by(Time) %>% 
                            summarise(counts = n_distinct(.data$Chick))))
#> # A tibble: 4 × 3
#>     Diet               data            counts
#>   <fctr>             <list>            <list>
#> 1      1 <tibble [220 × 3]> <tibble [12 × 2]>
#> 2      2 <tibble [120 × 3]> <tibble [12 × 2]>
#> 3      3 <tibble [120 × 3]> <tibble [12 × 2]>
#> 4      4 <tibble [118 × 3]> <tibble [12 × 2]>

To pipe it through a list, leave the first parameter of map blank to pass in the list over which to iterate:

ChickWeight %>% 
    split(.$Diet) %>% 
    map(~ .x %>% group_by(Time) %>%  summarise(counts = n_distinct(Chick))) %>% .[[1]]

#> # A tibble: 12 × 2
#>     Time counts
#>    <dbl>  <int>
#> 1      0     20
#> 2      2     20
#> 3      4     19
#> 4      6     19
#> 5      8     19
#> 6     10     19
#> 7     12     19
#> 8     14     18
#> 9     16     17
#> 10    18     17
#> 11    20     17
#> 12    21     16

A simpler option would be to just group by both columns:

ChickWeight %>% group_by(Diet, Time) %>% summarise(counts = n_distinct(Chick))

#> Source: local data frame [48 x 3]
#> Groups: Diet [?]
#> 
#>      Diet  Time counts
#>    <fctr> <dbl>  <int>
#> 1       1     0     20
#> 2       1     2     20
#> 3       1     4     19
#> 4       1     6     19
#> 5       1     8     19
#> 6       1    10     19
#> 7       1    12     19
#> 8       1    14     18
#> 9       1    16     17
#> 10      1    18     17
#> # ... with 38 more rows