1
votes

I was looking for an alternative to furrr:future_map() because when this function is run inside another function it copies all objects defined inside that function to each worker regardless of whether those objects are explicitly passed (https://github.com/DavisVaughan/furrr/issues/26).

It looks like parLapply() does the same thing when using clusterExport():

fun <- function(x) {
  big_obj <- 1
  cl <- parallel::makeCluster(2)
  parallel::clusterExport(cl, c("x"), envir = environment())
  parallel::parLapply(cl, c(1), function(x) {
    x + 1
    env <- environment()
    parent_env <- parent.env(env)
    return(list(this_env = env, parent_env = parent_env))
  })
}

res <- fun(1)
names(res[[1]]$parent_env)
#> [1] "cl"      "big_obj" "x"

Created on 2020-01-06 by the reprex package (v0.3.0)

How can I keep big_obj from getting copied to each worker? I am using a Windows machine so forking isn't an option.

1
On windows, you have to copy the data. The only way to not copy the data is to not have any data at all. Meaning, store it on disk and load only a subset to work on.F. Privé
I came across this post: stackoverflow.com/questions/35851761/…. It seems the issue I describe has to do with defining the worker function inside another function instead of the global environment.Giovanni Colitti

1 Answers

2
votes

You can change the environment of your local function so that it does not include big_obj by assigning e.g. only the base environment.

fun <- function(x) {
  big_obj <- 1
  cl <- parallel::makeCluster(2)
  on.exit(parallel::stopCluster(cl), add = TRUE)
  parallel::clusterExport(cl, c("x"), envir = environment())
  local_fun <- function(x) {
    x + 1
    env <- environment()
    parent_env <- parent.env(env)
    return(list(this_env = env, parent_env = parent_env))
  }
  environment(local_fun) <- baseenv()
  parallel::parLapply(cl, c(1), local_fun)
}
res <- fun(1)
"big_obj" %in% names(res[[1]]$parent_env) # FALSE