Parallelization in R: %dopar% vs %do%. Why using a single core yields to better performance?

Question

I'm experiencing a weird behaviour in my computer when distributing processes among its cores using doMC and foreach. Does someone knows why using single core I got better performance than using 2 cores? As you can see, processing the same code without register any core (which supposedly use only 1 core) yields to a much more time-efficiency processing. While %do% seems to perform better than %dopar%, registering 2 cores out of 4 yield to more time consuming.

require(foreach)
require(doMC)
# 1-core
> system.time(m <- foreach(i=1:100) %dopar% 
+ matrix(rnorm(1000*1000), ncol=5000) )
   user  system elapsed 
  9.285   1.895  11.083 
> system.time(m <- foreach(i=1:100) %do% 
+ matrix(rnorm(1000*1000), ncol=5000) )
   user  system elapsed 
  9.139   1.879  10.979 

# 2-core
> registerDoMC(cores=2)
> system.time(m <- foreach(i=1:100) %dopar% 
+ matrix(rnorm(1000*1000), ncol=5000) )
   user  system elapsed 
  3.322   3.737 132.027
> system.time(m <- foreach(i=1:100) %do% 
+ matrix(rnorm(1000*1000), ncol=5000) )
   user  system elapsed 
  9.744   2.054  11.740

Using 4 cores in few trials yield to very different outcomes:

> registerDoMC(cores=4)
> system.time(m <- foreach(i=1:100) %dopar% 
{ matrix(rnorm(1000*1000), ncol=5000) } )
   user  system elapsed 
 11.522   4.082  24.444 
> system.time(m <- foreach(i=1:100) %dopar% 
{ matrix(rnorm(1000*1000), ncol=5000) } )
   user  system elapsed 
 21.388   6.299  25.437 
> system.time(m <- foreach(i=1:100) %dopar% 
{ matrix(rnorm(1000*1000), ncol=5000) } )
   user  system elapsed 
 17.439   5.250   9.300 
> system.time(m <- foreach(i=1:100) %dopar% 
{ matrix(rnorm(1000*1000), ncol=5000) } )
   user  system elapsed 
 17.480   5.264   9.170

1. Producing a single matrix is not (in general) done in parallel unless you define how it is to be done, and you have not. You should expect worse results when using more than one core. 2. Does registerDoMC affect %do%? The results are similar. — Matthew Lundberg
It absolutely does make sense. You are asking two cores to update one object, which is programming a series of cache misses. It is much faster to run this code on one core, as you have seen. — Matthew Lundberg
Maybe you are right, but try to replicate the code using 3, 4, 6 cores. You will see the time dropping. Why does so? — daniel
@user792000 Multicore programming is a complex topic. One core means no locking, but two or more cores is the same number of locks and the same number of contentions. This is a very contrived problem, and one that is not suited to a parallel solution. In particular, it is not demonstrating any deficiency in the R parallel code. (This comment should not be construed to imply that there are no defects in the R parallel code.) — Matthew Lundberg

krlmlr krlmlr · Accepted Answer · 2013-03-04T04:52:08

It's the combination of results that eats all the processing time. These are the timings on my machine for the cores=2 scenario if no results are returned. It's essentially the same code, only the created matrices are discarded instead of being returned:

> system.time(m <- foreach(i=1:100) %do% 
+ { matrix(rnorm(1000*1000), ncol=5000); NULL } )
   user  system elapsed 
 13.793   0.376  14.197 
> system.time(m <- foreach(i=1:100) %dopar% 
+ { matrix(rnorm(1000*1000), ncol=5000); NULL } )
   user  system elapsed 
  8.057   5.236   9.970

Still not optimal, but at least the parallel version is now faster.

This is from documentation of doMC:

The doMC package provides a parallel backend for the foreach/%dopar% function using the multicore functionality of the parallel package.

Now, parallel uses a fork mechanism to spawn identical copies of the R process. Collecting results from separate processes is an expensive task, and this is what you see in your time measurements.

Parallelization in R: %dopar% vs %do%. Why using a single core yields to better performance?

1 Answers