Calling objective function in parallel when running optimization in R

Question

I'm doing optimization in R. My problem involves running nlm on an objective function which loops over a large list of data. I'd like to speed up the optimization by running the objective function in parallel. How should I go about doing that?

In the example below I set up a toy problem in which the parallelized solution is slower than the original. How do I modify the code to reduce overhead and speed up the parallelized version of my nlm call?

library(parallel)

## What is the right way to do optimization when the objective function is run in parallel?
## Don't want very_big_list to be copied more than necessary

set.seed(952)

my_objfn <- function(list_element, parameter) {
    return(sum((list_element - parameter) ^ 2))  # Simple example
}

apply_my_objfn_in_parallel <- function(parameter, very_big_list, max_cores=3) {
    cluster <- makeCluster(min(max_cores, detectCores() - 1))
    objfn_values <- parLapply(cluster, very_big_list, my_objfn, parameter=parameter)
    stopCluster(cluster)
    return(Reduce("+", objfn_values))
}

apply_my_objfn <- function(parameter, very_big_list) {
    objfn_values <- lapply(very_big_list, my_objfn, parameter=parameter)
    return(Reduce("+", objfn_values))
}

my_big_list <- replicate(2 * 10^6, sample(seq_len(100), size=5), simplify=FALSE)
parameter_guess <- 20
mean(c(my_big_list, recursive=TRUE))  # Should be close to 50
system.time(test_parallel <- nlm(apply_my_objfn_in_parallel, parameter_guess,
                                 very_big_list=my_big_list, print.level=0))  # 84.2 elapsed
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 63.6 elapsed

I ran this on my laptop (4 CPUs, so the cluster returned by makeCluster(min(max_cores, detectCores() - 1)) has 3 cores). In the last lines above, apply_my_objfn_in_parallel takes longer than apply_my_objfn. I think this is because (1) I only have 3 cores and (2) each time nlm calls the parallelized objective function, it sets up a new cluster and breaks up and copies all of my_big_list. That seems wasteful -- would I get better results if I somehow set up the cluster and copied the list only once per nlm call? If so, how do I do that?

Edit after Erwin's answer ("consider creating and stopping the cluster once instead of in each evaluation"):

## Modify function to use single cluster per nlm call
apply_my_objfn_in_parallel_single_cluster <- function(parameter, very_big_list, my_cluster) {
    objfn_values <- parLapply(my_cluster, very_big_list, my_objfn, parameter=parameter)
    return(Reduce("+", objfn_values))
}

run_nlm_single_cluster <- function(very_big_list, parameter_guess, max_cores=3) {
    cluster <- makeCluster(min(max_cores, detectCores() - 1))
    nlm_result <- nlm(apply_my_objfn_in_parallel_single_cluster, parameter_guess,
                      very_big_list=very_big_list, my_cluster=cluster, print.level=0)
    stopCluster(cluster)
    return(nlm_result)
}

system.time(test_parallel <- nlm(apply_my_objfn_in_parallel, parameter_guess,
                                 very_big_list=my_big_list, print.level=0))  # 49.0 elapsed
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 36.8 elapsed
system.time(test_single_cluster <- run_nlm_single_cluster(my_big_list,
                                                          parameter_guess))  # 38.4 elapsed

In addition to my laptop (elapsed times in comments above), I ran the code on a server with 30 cores. There my elapsed times were 107 for apply_my_objfn and 74 for run_nlm_single_cluster. I'm surprised that the times were longer than on my puny little laptop, but it makes sense that the single cluster parallel optimization beats the regular non-parallel version when you have more cores.

Another edit, for completeness (see comments under Erwin's answer): here is a non-parallel solution using analytical gradients. Surprisingly, it is slower than with numerical gradients.

## Add gradients
my_objfn_value_and_gradient <- function(list_element, parameter) {
    return(c(sum((list_element - parameter) ^ 2), -2*sum(list_element - parameter)))
}

apply_my_objfn_with_gradient <- function(parameter, very_big_list) {
    ## Returns objfn value with gradient attribute, see ?nlm
    objfn_values_and_grads <- lapply(very_big_list, my_objfn_value_and_gradient, parameter=parameter)
    objfn_value_and_grad <- Reduce("+", objfn_values_and_grads)
    stopifnot(length(objfn_value_and_grad) == 2)  # First is objfn value, second is gradient
    objfn_value <- objfn_value_and_grad[1]
    attr(objfn_value, "gradient") <- objfn_value_and_grad[2]
    return(objfn_value)
}

system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 37.4 elapsed
system.time(test_regular_grad <- nlm(apply_my_objfn_with_gradient, parameter_guess,
                                     very_big_list=my_big_list, print.level=0,
                                     check.analyticals=FALSE))  # 45.0 elapsed

I'd be curious to know what's going on here. That said, my question is still How can I speed up this sort of optimization problem using parallelization?

I think what I'm looking for is related to the comments in stackoverflow.com/questions/6689937/… -- particularly wanting to " load chunks of data individually in each node" — Adrian
github.com/hadley/multidplyr/blob/master/vignettes/… looks relevant, I'll have a look at that — Adrian
The "Example: Cost of data movement" slide at labs.hpe.com/research/systems-research/R-workshop/… is what I had in mind when I wrote "don't want very_big_list to be copied more than necessary" in my example code above — Adrian
Maybe github.com/vertica/DistributedR is what I'm looking for — Adrian
The following answer could help: stackoverflow.com/questions/3757321/… — Nairolf

Erwin Kalvelagen Erwin Kalvelagen · Accepted Answer · 2016-10-09T11:45:44

Looks to me there is too much overhead in the parallel function evaluation to make it worthwhile. Consider creating and stopping the cluster once instead of in each evaluation. Also I believe you don't provide gradients so the solver will likely do finite differences, which can lead to a large number of function evaluation calls. You may want to consider providing gradients.

Calling objective function in parallel when running optimization in R

1 Answers