0
votes

I'm doing optimization in R. My problem involves running nlm on an objective function which loops over a large list of data. I'd like to speed up the optimization by running the objective function in parallel. How should I go about doing that?

In the example below I set up a toy problem in which the parallelized solution is slower than the original. How do I modify the code to reduce overhead and speed up the parallelized version of my nlm call?

library(parallel)

## What is the right way to do optimization when the objective function is run in parallel?
## Don't want very_big_list to be copied more than necessary

set.seed(952)

my_objfn <- function(list_element, parameter) {
    return(sum((list_element - parameter) ^ 2))  # Simple example
}

apply_my_objfn_in_parallel <- function(parameter, very_big_list, max_cores=3) {
    cluster <- makeCluster(min(max_cores, detectCores() - 1))
    objfn_values <- parLapply(cluster, very_big_list, my_objfn, parameter=parameter)
    stopCluster(cluster)
    return(Reduce("+", objfn_values))
}

apply_my_objfn <- function(parameter, very_big_list) {
    objfn_values <- lapply(very_big_list, my_objfn, parameter=parameter)
    return(Reduce("+", objfn_values))
}

my_big_list <- replicate(2 * 10^6, sample(seq_len(100), size=5), simplify=FALSE)
parameter_guess <- 20
mean(c(my_big_list, recursive=TRUE))  # Should be close to 50
system.time(test_parallel <- nlm(apply_my_objfn_in_parallel, parameter_guess,
                                 very_big_list=my_big_list, print.level=0))  # 84.2 elapsed
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 63.6 elapsed

I ran this on my laptop (4 CPUs, so the cluster returned by makeCluster(min(max_cores, detectCores() - 1)) has 3 cores). In the last lines above, apply_my_objfn_in_parallel takes longer than apply_my_objfn. I think this is because (1) I only have 3 cores and (2) each time nlm calls the parallelized objective function, it sets up a new cluster and breaks up and copies all of my_big_list. That seems wasteful -- would I get better results if I somehow set up the cluster and copied the list only once per nlm call? If so, how do I do that?


Edit after Erwin's answer ("consider creating and stopping the cluster once instead of in each evaluation"):

## Modify function to use single cluster per nlm call
apply_my_objfn_in_parallel_single_cluster <- function(parameter, very_big_list, my_cluster) {
    objfn_values <- parLapply(my_cluster, very_big_list, my_objfn, parameter=parameter)
    return(Reduce("+", objfn_values))
}

run_nlm_single_cluster <- function(very_big_list, parameter_guess, max_cores=3) {
    cluster <- makeCluster(min(max_cores, detectCores() - 1))
    nlm_result <- nlm(apply_my_objfn_in_parallel_single_cluster, parameter_guess,
                      very_big_list=very_big_list, my_cluster=cluster, print.level=0)
    stopCluster(cluster)
    return(nlm_result)
}

system.time(test_parallel <- nlm(apply_my_objfn_in_parallel, parameter_guess,
                                 very_big_list=my_big_list, print.level=0))  # 49.0 elapsed
system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 36.8 elapsed
system.time(test_single_cluster <- run_nlm_single_cluster(my_big_list,
                                                          parameter_guess))  # 38.4 elapsed

In addition to my laptop (elapsed times in comments above), I ran the code on a server with 30 cores. There my elapsed times were 107 for apply_my_objfn and 74 for run_nlm_single_cluster. I'm surprised that the times were longer than on my puny little laptop, but it makes sense that the single cluster parallel optimization beats the regular non-parallel version when you have more cores.


Another edit, for completeness (see comments under Erwin's answer): here is a non-parallel solution using analytical gradients. Surprisingly, it is slower than with numerical gradients.

## Add gradients
my_objfn_value_and_gradient <- function(list_element, parameter) {
    return(c(sum((list_element - parameter) ^ 2), -2*sum(list_element - parameter)))
}

apply_my_objfn_with_gradient <- function(parameter, very_big_list) {
    ## Returns objfn value with gradient attribute, see ?nlm
    objfn_values_and_grads <- lapply(very_big_list, my_objfn_value_and_gradient, parameter=parameter)
    objfn_value_and_grad <- Reduce("+", objfn_values_and_grads)
    stopifnot(length(objfn_value_and_grad) == 2)  # First is objfn value, second is gradient
    objfn_value <- objfn_value_and_grad[1]
    attr(objfn_value, "gradient") <- objfn_value_and_grad[2]
    return(objfn_value)
}

system.time(test_regular <- nlm(apply_my_objfn, parameter_guess,
                                very_big_list=my_big_list, print.level=0))  # 37.4 elapsed
system.time(test_regular_grad <- nlm(apply_my_objfn_with_gradient, parameter_guess,
                                     very_big_list=my_big_list, print.level=0,
                                     check.analyticals=FALSE))  # 45.0 elapsed

I'd be curious to know what's going on here. That said, my question is still How can I speed up this sort of optimization problem using parallelization?

1
I think what I'm looking for is related to the comments in stackoverflow.com/questions/6689937/… -- particularly wanting to " load chunks of data individually in each node"Adrian
github.com/hadley/multidplyr/blob/master/vignettes/… looks relevant, I'll have a look at thatAdrian
The "Example: Cost of data movement" slide at labs.hpe.com/research/systems-research/R-workshop/… is what I had in mind when I wrote "don't want very_big_list to be copied more than necessary" in my example code aboveAdrian
Maybe github.com/vertica/DistributedR is what I'm looking forAdrian
The following answer could help: stackoverflow.com/questions/3757321/…Nairolf

1 Answers

2
votes

Looks to me there is too much overhead in the parallel function evaluation to make it worthwhile. Consider creating and stopping the cluster once instead of in each evaluation. Also I believe you don't provide gradients so the solver will likely do finite differences, which can lead to a large number of function evaluation calls. You may want to consider providing gradients.