This question is specific to using parallel processing in R using foreach and dopar. I have created a simple dataset and a simple operation (the actual operation is more complex and hence I am presenting a simple operation here). The code for the data and the current code is posted for your reference.
Load packages and create data
#Creating a mock dataframe
Area =c('XX','YY','ZZ','XX','YY','ZZ','XX','YY','ZZ','YY')
Car_type = c('A','A','B','C','B','C','A','A','B','C')
Variable1=c(.34,.19,.85,.27,.32,.43,.22,.56,.17,.11)
Variable2=c(.76,.3,.16,.24,.47,.23,.87,.27,.43,.59)
Final_data = data.frame(Area,Car_type,Variable1,Variable2)
#replicate the above 100 times to create a bigger dataset
n =100
Final_data2=do.call("rbind", replicate(n, Final_data, simplify = FALSE))
Final_data2$Final_value = 0
#car_list = unique(Final_data2$Car_type) #have not figured out how to use this
dopar foreach code
#Create clusters and load required packages the clusters
library(doParallel)
cl=makeCluster(3,type="PSOCK")
registerDoParallel(cl)
home1 <- function(zz1){
output <- foreach(x = iter(zz1, by = "row"), .combine = rbind,
.packages = "truncnorm") %dopar% {
if (x$Car_type=='A'){
x$Final_value = rtruncnorm(1,a=-1,b=1,mean = x$Variable1,sd=x$Variable2)
} else if(x$Car_type=='B'){
x$Final_value = rtruncnorm(1,a=-5,b=5,mean = x$Variable1,sd=1)
} else{
x$Final_value = rtruncnorm(1,a=-10,b=10,mean = 1,sd=1)
}
return(x)
}
output
}
Final_data3 <- home1(zz1=Final_data2)
stopCluster(cl) #Stop cluster
In the first part I create a sample dataframe called Final_data2. In the second part, based on the Type of car in column "Car_type", I generate a value from a truncated normal distributions where the truncation points and the mean and standard deviation changes depending on the Car_type. This code works in the current format. It iterates through each line one after using the different cores.
Issue
Now I want to extend this in such a way that instead of iterating and running the operation on each line on a separate core, I want to run the operations on blocks of the dataset. What I would like to do is to run the dopar foreach part for the different Areas on separate cores. For ex. I want to run the dopar foreach loop for Area = XX on cluster 1, Area = YY on cluster 2 and Area = ZZ on cluster 3. Unfortunately, I could not figure this out by myself. Would someone be help me with this? Any help will be appreciated.
Edit As Prive pointed out, the initial question was a little confusing. I have modified the question a little bit. Please let me know if this is a little bit clearer now.
group_by
orcase_when
of package {dplyr}. – F. Privé