1
votes

In tidy R, how do I parallelize a grouped summarize (or mutate) function call? A transform to the iris dataset illustrates my problem.

I created a simple function - it takes two numerical vectors as args. It returns a list with a 2-column tibble.

 library(tidyverse)
 geoMaxMean <- function(pLen, pWid){
    list(
      tibble(maxLen = max(pLen), 
             geoMean = sqrt(max(pLen) * max(pWid))))}

Applying this to iris

 gIris <- iris %>% 
    as_tibble() %>% 
    group_by(Species) %>% 
    summarise(Cols2 = geoMaxMean(Petal.Length, Petal.Width)) %>% 
    unnest(Cols2)

Gives the intended result.

Species     maxLen      geoMean
setosa      1.9         1.067708
versicolor  5.1         3.029851
virginica   6.9         4.153312

How do I parallelize the geoMaxMean call? I've tried to rework the call with lappply or foreach but I haven't been able to figure it out.

I'm running R 3.4.4 on RStudio Pro.

2
The only thing meant to work directly within the tidyverse for parallel processing that I know of is multidplyr but I don't think it's under terribly active development.joran

2 Answers

1
votes

here's a chunk of code to get that done using the pbmcapply package. The mcapply package would also work just fine, and would function the same way, but this way you get a progress bar, which is handy.

library(tidyverse)
library(magrittr)
library(pbmcapply)

allSpecies <- 
  iris %>%
  pull(Species) %>%
  unique 

geoMaxMean <- 
  function(species, data){
    data <- data[data$Species == species,]
    pLen <- data$Petal.Length
    pWid <-  data$Petal.Width
    rm(data)

    out <- 
      tibble(maxLen = max(pLen), 
             geoMean = sqrt(max(pLen) * max(pWid))
             )
    return(out)
}

nCores <- 
  detectCores() %>%
  subtract(2)

gIris <-
  allSpecies %>%
  as.list %>%
  pbmclapply(geoMaxMean,
             data = iris,
             mc.cores = nCores
             ) %>%
  bind_rows %>%
  tibble("Species" = allSpecies, .)

The key difference here is that you have to rethink what goes into the function you're feeding into the parallelized apply function. Your original snippet of code assigned all the calculations to a function, and then tried to group everything afterwards. If you design your function to split the data into a subgroup, then perform your calculation, it is very easy to parallelize by using a list of all grouping labels as the input list into pbmclapply, and simply supply your data as an argument to the function, rather than the input.

Hope this helps.

0
votes

You can also do this with dplyr::group_nest, future, and furrr:future_map_dfr.

(In case it matters, I'm using dplyr 1.0.7, furrr 0.2.3, tidyr 1.1.2, and future 1.21.0)

First, you use group_nest to put groups together before splitting for parallelization (e.g. by worker_id as below). Then you run on each of the separated worker groups, and future_map_dfr automatically recombines into a tibble or dataframe (e.g. equivalent of running bind_rows at the end):

library(tidyverse)

geoMaxMean <- function(pLen, pWid) {
  list(
    tibble(maxLen = max(pLen), 
           geoMean = sqrt(max(pLen) * max(pWid))))
  }

n_workers <- 4
# Setup parallelization
future::plan(future::multisession, workers=n_workers)

gIris <- iris %>% 
  as_tibble() %>% 
  group_by(Species) %>% 
  summarise(Cols2 = geoMaxMean(Petal.Length, Petal.Width)) %>% 
  unnest(Cols2)

gIris_parallel <- iris %>% 
  group_nest(Species, .key="grouped_data") %>% 
  dplyr::mutate(.worker_id = sample(1:n_workers, replace=T, size=nrow(.))) %>% 
  dplyr::group_split(.worker_id, .keep=F) %>% 
  furrr::future_map_dfr(
    function(.data) tidyr::unnest(.data, grouped_data) %>% 
      group_by(Species) %>% 
      summarise(Cols2 = geoMaxMean(Petal.Length, Petal.Width)) %>% 
      unnest(Cols2)
  )

As an aside, note that running summarise on a function that returns a tibble automatically unpacks the columns, and eliminates the need for dummy variable Col2:

geoMaxMean_to_tibble <- function(pLen, pWid) {
    tibble(maxLen = max(pLen), 
           geoMean = sqrt(max(pLen) * max(pWid)))
  }

gIris <- iris %>% 
  as_tibble() %>% 
  group_by(Species) %>% 
  summarise(geoMaxMean_to_tibble(Petal.Length, Petal.Width))
  # No need to call unnest