I'm doing optimization in R. My problem involves running nlm
on an objective function which loops over a large list of data. I'd like to speed up the optimization by running the objective function in parallel. How should I go about doing that?
In the example below I set up a toy problem in which the parallelized solution is slower than the original. How do I modify the code to reduce overhead and speed up the parallelized version of my nlm
call?
library(parallel)
## What is the right way to do optimization when the objective function is run in parallel?
## Don't want very_big_list to be copied more than necessary
set.seed(952)
my_objfn <- function(list_element, parameter) {
return(sum((list_element - parameter) ^ 2)) # Simple example
}
apply_my_objfn_in_parallel <- function(parameter, very_big_list, max_cores=3) {
cluster <- makeCluster(min(max_cores, detectCores() - 1))
objfn_values <- parLapply(cluster, very_big_list, my_objfn, parameter=parameter)
stopCluster(cluster)
return(Reduce("+", objfn_values))
}
apply_my_objfn <- function(parameter, very_big_list) {
objfn_values <- lapply(very_big_list, my_objfn, parameter=parameter)
return(Reduce("+", objfn_values))
}
my_big_list <- replicate(2 * 10^6, sample(seq_len(100), size=5), simplify=FALSE)
parameter_guess <- 20
mean(c(my_big_list, recursive=TRUE)) # Should be close to 50
system.time(test_parallel <- nlm(apply_my_objfn_in_parallel, parameter_guess,
very_big_list=my_big_list, print.level=0)) # 84.2 elapsed
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
very_big_list=my_big_list, print.level=0)) # 63.6 elapsed
I ran this on my laptop (4 CPUs, so the cluster returned by makeCluster(min(max_cores, detectCores() - 1))
has 3 cores). In the last lines above, apply_my_objfn_in_parallel
takes longer than apply_my_objfn
. I think this is because (1) I only have 3 cores and (2) each time nlm
calls the parallelized objective function, it sets up a new cluster and breaks up and copies all of my_big_list
. That seems wasteful -- would I get better results if I somehow set up the cluster and copied the list only once per nlm
call? If so, how do I do that?
Edit after Erwin's answer ("consider creating and stopping the cluster once instead of in each evaluation"):
## Modify function to use single cluster per nlm call
apply_my_objfn_in_parallel_single_cluster <- function(parameter, very_big_list, my_cluster) {
objfn_values <- parLapply(my_cluster, very_big_list, my_objfn, parameter=parameter)
return(Reduce("+", objfn_values))
}
run_nlm_single_cluster <- function(very_big_list, parameter_guess, max_cores=3) {
cluster <- makeCluster(min(max_cores, detectCores() - 1))
nlm_result <- nlm(apply_my_objfn_in_parallel_single_cluster, parameter_guess,
very_big_list=very_big_list, my_cluster=cluster, print.level=0)
stopCluster(cluster)
return(nlm_result)
}
system.time(test_parallel <- nlm(apply_my_objfn_in_parallel, parameter_guess,
very_big_list=my_big_list, print.level=0)) # 49.0 elapsed
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
very_big_list=my_big_list, print.level=0)) # 36.8 elapsed
system.time(test_single_cluster <- run_nlm_single_cluster(my_big_list,
parameter_guess)) # 38.4 elapsed
In addition to my laptop (elapsed times in comments above), I ran the code on a server with 30 cores. There my elapsed times were 107 for apply_my_objfn
and 74 for run_nlm_single_cluster
. I'm surprised that the times were longer than on my puny little laptop, but it makes sense that the single cluster parallel optimization beats the regular non-parallel version when you have more cores.
Another edit, for completeness (see comments under Erwin's answer): here is a non-parallel solution using analytical gradients. Surprisingly, it is slower than with numerical gradients.
## Add gradients
my_objfn_value_and_gradient <- function(list_element, parameter) {
return(c(sum((list_element - parameter) ^ 2), -2*sum(list_element - parameter)))
}
apply_my_objfn_with_gradient <- function(parameter, very_big_list) {
## Returns objfn value with gradient attribute, see ?nlm
objfn_values_and_grads <- lapply(very_big_list, my_objfn_value_and_gradient, parameter=parameter)
objfn_value_and_grad <- Reduce("+", objfn_values_and_grads)
stopifnot(length(objfn_value_and_grad) == 2) # First is objfn value, second is gradient
objfn_value <- objfn_value_and_grad[1]
attr(objfn_value, "gradient") <- objfn_value_and_grad[2]
return(objfn_value)
}
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
very_big_list=my_big_list, print.level=0)) # 37.4 elapsed
system.time(test_regular_grad <- nlm(apply_my_objfn_with_gradient, parameter_guess,
very_big_list=my_big_list, print.level=0,
check.analyticals=FALSE)) # 45.0 elapsed
I'd be curious to know what's going on here. That said, my question is still How can I speed up this sort of optimization problem using parallelization?