How to parallelize a grouped mutate/summarise in R

Question

In tidy R, how do I parallelize a grouped summarize (or mutate) function call? A transform to the iris dataset illustrates my problem.

I created a simple function - it takes two numerical vectors as args. It returns a list with a 2-column tibble.

 library(tidyverse)
 geoMaxMean <- function(pLen, pWid){
    list(
      tibble(maxLen = max(pLen), 
             geoMean = sqrt(max(pLen) * max(pWid))))}

Applying this to iris

 gIris <- iris %>% 
    as_tibble() %>% 
    group_by(Species) %>% 
    summarise(Cols2 = geoMaxMean(Petal.Length, Petal.Width)) %>% 
    unnest(Cols2)

Gives the intended result.

Species     maxLen      geoMean
setosa      1.9         1.067708
versicolor  5.1         3.029851
virginica   6.9         4.153312

How do I parallelize the geoMaxMean call? I've tried to rework the call with lappply or foreach but I haven't been able to figure it out.

I'm running R 3.4.4 on RStudio Pro.

The only thing meant to work directly within the tidyverse for parallel processing that I know of is multidplyr but I don't think it's under terribly active development. — joran

Graeme Frost Graeme Frost · Accepted Answer · 2019-05-21T15:47:53

here's a chunk of code to get that done using the pbmcapply package. The mcapply package would also work just fine, and would function the same way, but this way you get a progress bar, which is handy.

library(tidyverse)
library(magrittr)
library(pbmcapply)

allSpecies <- 
  iris %>%
  pull(Species) %>%
  unique 

geoMaxMean <- 
  function(species, data){
    data <- data[data$Species == species,]
    pLen <- data$Petal.Length
    pWid <-  data$Petal.Width
    rm(data)

    out <- 
      tibble(maxLen = max(pLen), 
             geoMean = sqrt(max(pLen) * max(pWid))
             )
    return(out)
}

nCores <- 
  detectCores() %>%
  subtract(2)

gIris <-
  allSpecies %>%
  as.list %>%
  pbmclapply(geoMaxMean,
             data = iris,
             mc.cores = nCores
             ) %>%
  bind_rows %>%
  tibble("Species" = allSpecies, .)

The key difference here is that you have to rethink what goes into the function you're feeding into the parallelized apply function. Your original snippet of code assigned all the calculations to a function, and then tried to group everything afterwards. If you design your function to split the data into a subgroup, then perform your calculation, it is very easy to parallelize by using a list of all grouping labels as the input list into pbmclapply, and simply supply your data as an argument to the function, rather than the input.

Hope this helps.

How to parallelize a grouped mutate/summarise in R

2 Answers