Question
I've noticed that foreach/%dopar% performs sequential, not parallel setup of a cluster prior to executing tasks in parallel. If each worker requires a dataset and it takes N seconds to transfer the dataset to the worker, then foreach/%dopar% spends #workers * N seconds
of setup time. This can be significant for large # of workers or a large N (large datasets to transfer).
My question is whether this is by design or is there some parameter/setting that I'm missing in foreach or perhaps in cluster generation?
Setup
- R 2.15.2
- latest versions of foreach/parallel/doParallel as of today (1/7/2013)
- Windows 7 x64
Example
library( foreach )
library( parallel )
library( doParallel )
# lots of data
data = eval( rnorm( 100000000 ) )
# make cluster/register - creates 6 nodes fairly quickly
cluster = makePSOCKcluster( 6 , outfile = "" )
registerDoParallel( cluster )
# fire up Task Manager. Observer that each node recieves data sequentially.
# When last node gets data, then all nodes process at the same time
results = foreach( i = 1 : 500 ) %dopar%
{
print( data[ i ] )
return( data[ i ] )
}
clusterExport()
(viaclusterCall()
) executes sequentially, I don't think I'll hold my breath until then. – BenBarnesfork
a process, allowing child processes to access objects loaded in the parent process, only copying those that are modified. Windows machines don't have this particular capability, and with all of the cluster types I've used (which is not all), cluster setup has happened sequentially. – BenBarnes