@parallel vs. native loops in julia

Question

I run some example and I got some result. I got for the large number of iteration we can get a good result but for less amount of iteration we can get a worse result.

I know there is a little overhead and it's absolutely ok, but is there any way to run some loop with less amount of iteration in parallel way better than sequential way?

x = 0
@time for i=1:200000000
    x = Int(rand(Bool)) + x
end

7.503359 seconds (200.00 M allocations: 2.980 GiB, 2.66% gc time)

x = @time @parallel (+) for i=1:200000000
    Int(rand(Bool))
end

0.432549 seconds (3.91 k allocations: 241.138 KiB)

I got good result for parallel here but in following example not.

x2 = 0
@time for i=1:100000
    x2 = Int(rand(Bool)) + x2
end

0.006025 seconds (98.97 k allocations: 1.510 MiB)

x2 = @time @parallel (+) for i=1:100000
    Int(rand(Bool))
end

0.084736 seconds (3.87 k allocations: 239.122 KiB)

I guess the overhead of using threads is worthwhile after a certain number of iterations — Maurice Perry
@MauricePerry This isn't threads, it's multiprocessing. Multiprocessing has a lot more overhead than threads since it's fully asynchronous and can even have processes on other computers. @ReD you need have "enough" work on each process for multiprocessing to payoff. Otherwise you should look at using threads via Threads.@threads. — Chris Rackauckas
Would you mind some terminology first? This is not a [Parallel]-process execution but a just-[Concurrent]-scheduling -- As defined (cit:)-- [Parallel] processing is, in sharp contrast to just a [Concurrent] processing, guaranteed to start / perform / finish all thread-level and/or instruction-level tasks executed in a parallel fashion and provides a guaranteed finish of the simultaneously executed code-paths. If it were for trully [Parallel]-execution, there ought be 200E+6 CPU-cores to allow for such [Parallel]-execution. Not here, @parallel decorator never creates CPUs — user3666197
@ReD for details on ( SEQ:setup-overheads, PAR:HPC-payload ) accelerations, do not hesitate to view a post about an ( overhead-aware ) Amdahl-Law + data on Point-of-Diminishing-Returns, as depicted in achievable speedups in similarly motivated problem in >>> stackoverflow.com/a/45562881 — user3666197

stefan bachert stefan bachert · Accepted Answer · 2017-08-15T20:04:01

Doing things in parallel will ALWAYS be less efficient. It is because doing things parallel has always overhead to synchronize. Anyway the hope is, to get the result earlies on wall time than a pure sequential call (one computer, single core)

Your number are astonishing, and I found the cause.

First of all, allow to use all cores, goto into REPL

julia> nworkers
4

# original case to get correct relative times
julia> x = 0
julia> @time for i=1:200000000
          x = Int(rand(Bool)) + x
       end

7.864891 seconds (200.00 M allocations: 2.980 GiB, 1.62% gc time)

julia> x = @time @parallel (+) for i=1:200000000
          Int(rand(Bool))
       end
0.350262 seconds (4.08 k allocations: 254.165 KiB)
99991471

# now a correct benchmark
julia> function test()
         x = 0
         for i=1:200000000
           x = Int(rand(Bool)) + x
         end
       end
julia> @time test()
0.465478 seconds (4 allocations: 160 bytes)

What happend?

Your first test case uses an global variable x. And that is terrible slow. The case access 200 000 000 times a slow variable.

In the second test case the global variable x is assigned just one time, so the poor performance is not coming into account

In my test case there is no global variable. I used a local variable. Local variables are much faster (due to better compiler optimizations)

@parallel vs. native loops in julia

2 Answers