2
votes

This question is specific to using parallel processing in R using foreach and dopar. I have created a simple dataset and a simple operation (the actual operation is more complex and hence I am presenting a simple operation here). The code for the data and the current code is posted for your reference.

Load packages and create data

#Creating a mock dataframe
Area =c('XX','YY','ZZ','XX','YY','ZZ','XX','YY','ZZ','YY')
Car_type = c('A','A','B','C','B','C','A','A','B','C')
Variable1=c(.34,.19,.85,.27,.32,.43,.22,.56,.17,.11)
Variable2=c(.76,.3,.16,.24,.47,.23,.87,.27,.43,.59)
Final_data = data.frame(Area,Car_type,Variable1,Variable2)    
#replicate the above 100 times to create a bigger dataset
n =100
Final_data2=do.call("rbind", replicate(n, Final_data, simplify = FALSE))
Final_data2$Final_value = 0
#car_list = unique(Final_data2$Car_type) #have not figured out how to use this

dopar foreach code

#Create clusters and load required packages the clusters 
library(doParallel)    
cl=makeCluster(3,type="PSOCK") 
registerDoParallel(cl)


home1 <- function(zz1){
  output <- foreach(x = iter(zz1, by = "row"), .combine = rbind, 
                    .packages = "truncnorm") %dopar% {
    if (x$Car_type=='A'){
      x$Final_value = rtruncnorm(1,a=-1,b=1,mean = x$Variable1,sd=x$Variable2)
    } else if(x$Car_type=='B'){
      x$Final_value = rtruncnorm(1,a=-5,b=5,mean = x$Variable1,sd=1)  
    }  else{
      x$Final_value = rtruncnorm(1,a=-10,b=10,mean = 1,sd=1)
    }
    return(x)
  }
  output
}
Final_data3 <- home1(zz1=Final_data2)
stopCluster(cl) #Stop cluster

In the first part I create a sample dataframe called Final_data2. In the second part, based on the Type of car in column "Car_type", I generate a value from a truncated normal distributions where the truncation points and the mean and standard deviation changes depending on the Car_type. This code works in the current format. It iterates through each line one after using the different cores.

Issue

Now I want to extend this in such a way that instead of iterating and running the operation on each line on a separate core, I want to run the operations on blocks of the dataset. What I would like to do is to run the dopar foreach part for the different Areas on separate cores. For ex. I want to run the dopar foreach loop for Area = XX on cluster 1, Area = YY on cluster 2 and Area = ZZ on cluster 3. Unfortunately, I could not figure this out by myself. Would someone be help me with this? Any help will be appreciated.

Edit As Prive pointed out, the initial question was a little confusing. I have modified the question a little bit. Please let me know if this is a little bit clearer now.

1
I'm not sure I understand what you want. Could you please provide some code you tried to solve your issue?F. Privé
Instead of iterating by each row of the dataframe, I want to create a subset of the dataframe by Car_type first and then run the function. My only feeble attempt was to change by = "row" in the iter function to by = car_list (last line of the code for the data creation section). The car_list is just a list of all the unique cars. Unfortunately, when I did that, I get the following message: Error in match.arg(by) : 'arg' must be NULL or a character vectorPrometheus
I'm not sure I understand your problem but you might want to look at functions group_by or case_when of package {dplyr}.F. Privé
@F.Privé Hi Prive. I realized what was causing the confusion. Would you please take a look at the modified question. I have a column of regions (XX, YY, ZZ) which I want to run on the different clusters instead of running the code per line. Does this makes sense. I realized that in the previous version, since I had one column of Car_type and the function was essentially looping through the car types, my question was a little problematic.Prometheus
Were you able to do this using the dopar appraoch? I have the similar problem89_Simple

1 Answers

1
votes

For your particular application, I would have used pmap::purrr():

home2 <- function(Car_type, Variable1, Variable2) {
  if (Car_type=='A'){
    truncnorm::rtruncnorm(1,a=-1,b=1,mean = Variable1,sd=Variable2)
  } else if(Car_type=='B'){
    truncnorm::rtruncnorm(1,a=-5,b=5,mean = Variable1,sd=1)  
  }  else{
    truncnorm::rtruncnorm(1,a=-10,b=10,mean = 1,sd=1)
  }
}

Final_data2$Final_value <- 
  purrr::pmap_dbl(Final_data2[c("Car_type", "Variable1", "Variable2")], home2)

If this operation is really taking a long time, you can easily parallelize it using packages {future} and {furrr}:

future::plan(future::multiprocess)
Final_data2$Final_value <- 
  furrr::future_pmap_dbl(Final_data2[c("Car_type", "Variable1", "Variable2")], home2)