Using purrr to iterate over two lists and then pipe into dplyr::filter

Question

library(tidyverse)
library(purrr)

Using the sample data below, I can create the following function:

Funs <- function(DF, One, Two){

    One <- enquo(One)
    Two <- enquo(Two)

    DF %>% filter(School == (!!One) & Code == (!!Two)) %>%
        group_by(Code, School) %>%
        summarise(Count = sum(Question1))
}

I can then use the function to filter on two variables - School and Code - like this:

Funs(DF, "School1", "B344")

That's all good, but my actual data has many variables and therefore instead of having to constantly type the "School" and "Code" variables into the function, I would like to use tidyverse and the purrr package to loop over two lists (one of School, and one for Code) and feed this into filter. I would like the output to be a list of results.

For the sake of simplicity, the two lists to feed into dplyr::filter will only have two values each: School2 will go with S300, and School1 with B344, just like the example above.

Some examples that I've tried:

map2(c(“School2”, ”School1”),
     c(“S300”, ”B344”),
     function(x,y) {
         DF %>% filter(School == .x & Code == .y) %>%
             group_by(Code, School) %>%
             summarise(Count = sum(Question1))
     }

Also...

map2(c("School2", "School1")),
     c("S300","B344"),
     ~filter(School == .x & Code == .y) %>%
         group_by(Code, School)%>%
         summarise(Count = sum(Question1))

And this...

list(c("School2", "School1"), c("S300", "B344")) %>%
    map2( ~ filter(School == .x & Code == .y) %>%
             group_by(Code, School) %>%
             summarise(Count = sum(Question1)))

None of these seem to work, so help would be appreciated!

Sample data:

Code <- c("B344","B555","S300","T220","B888","B888","B555","B344","B344","T220","B555","B555","S300","B555","S300","S300","S300","S300","B344","B344","B888","B888","B888")
School <- c("School1","School1","School2","School3","School4","School4","School1","School1","School3","School3","School4","School1","School1","School3","School2","School2","School4","School2","School3","School4","School3","School1","School2")
Question1 <- c(3,4,5,4,5,5,5,4,5,3,4,5,4,5,4,3,3,3,4,5,4,3,3)
Question2 <- c(5,4,3,4,3,5,4,3,2,3,4,5,4,5,4,3,4,4,5,4,3,3,4)
DF <- data_frame(Code, School, Question1, Question2)

You could do something like map2(c("School2", "School1"), c("S300", "B344"), ~DF %>% filter(School == .x, Code == .y) %>% group_by(Code, School) %>% summarise(Count = sum(Question1))), but this seems really pointless; it's easier to do something like DF %>% filter(paste(School, Code) %in% paste(c("School2", "School1"), c("S300", "B344"))) %>% group_by(Code, School) %>% summarise(Count = sum(Question1)) — alistaire
It looks like your first suggestion is what I was looking for. I realize now that using map2_df is probably better. Also, it's probably better for me to create two lists, like list1<-c("School2","School1") and list2<-c("S300","B344") and then using these since my actual data has at least a dozen values for each list. So I'm thinking this... map2_df(list1,list2, ~df%>%filter(School==.x,Code==.y)%>%group_by(Code,School)%>%summarise(Count=sum(Question1))) — Mike
I can accept your first response if you post it as an official answer. Also, I'm curious to know why you think the second suggestion is better? The output was probably better, but I can use map2_df for that... — Mike

alistaire alistaire · Accepted Answer · 2018-01-08T02:13:52

Here's some options, from most like your code to most optimal:

library(tidyverse)

DF <- data_frame(Code = c("B344", "B555", "S300", "T220", "B888", "B888", "B555", "B344", "B344", "T220", "B555", "B555", "S300", "B555", "S300", "S300", "S300", "S300", "B344", "B344", "B888", "B888", "B888"), 
                 School = c("School1", "School1", "School2", "School3", "School4", "School4", "School1", "School1", "School3", "School3", "School4", "School1", "School1", "School3", "School2", "School2", "School4", "School2", "School3", "School4", "School3", "School1", "School2"), 
                 Question1 = c(3, 4, 5, 4, 5, 5, 5, 4, 5, 3, 4, 5, 4, 5, 4, 3, 3, 3, 4, 5, 4, 3, 3), 
                 Question2 = c(5, 4, 3, 4, 3, 5, 4, 3, 2, 3, 4, 5, 4, 5, 4, 3, 4, 4, 5, 4, 3, 3, 4))

wanted <- data_frame(School = c("School2", "School1"),
                     Code = c("S300", "B344"))

To get map2 to work, if using tilde notation, the variables are named .x and .y; if you use regular function notation, you can call them whatever you like. Don't forget that the first parameter of filter is the data frame piped in, so:

map2_dfr(wanted$School, wanted$Code, ~filter(DF, School == .x, Code == .y)) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

Since I set up wanted as a data frame (a vanilla list would work too), you can use pmap instead. Parameter names with pmap can actually be the same as map2 for two variables, but it's really a function with ... for its parameters, so it often makes sense to handle them differently, e.g. with ..1 notation:

wanted %>% 
    pmap_dfr(~filter(DF, School == ..1, Code == ..2)) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

The problem with both of the above techniques is that at scale, they will be slow, because they are running filter for every row of wanted, meaning you're retesting each row many times. To keep the code similar, a slightly hacky way to avoid the extra work is to combine the columns into one, e.g. with tidyr::unite:

DF %>% 
    unite(school_code, School, Code) %>% 
    filter(school_code %in% invoke(paste, wanted, sep = '_')) %>%    # or paste(wanted$School, wanted$Code, sep = '_') or equivalent
    separate(school_code, c('School', 'Code')) %>%
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

...or just combine them within filter itself:

DF %>% 
    filter(paste(School, Code) %in% paste(wanted$School, wanted$Code)) %>%    # or invoke(paste, wanted)
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

The best way to get your desired result is perhaps more obvious now that I've set up wanted as a data frame: a join, which is designed to do precisely this job:

DF %>% 
    inner_join(wanted) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> Joining, by = c("Code", "School")
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

Using purrr to iterate over two lists and then pipe into dplyr::filter

1 Answers