1
votes

I run some example and I got some result. I got for the large number of iteration we can get a good result but for less amount of iteration we can get a worse result.

I know there is a little overhead and it's absolutely ok, but is there any way to run some loop with less amount of iteration in parallel way better than sequential way?

x = 0
@time for i=1:200000000
    x = Int(rand(Bool)) + x
end

7.503359 seconds (200.00 M allocations: 2.980 GiB, 2.66% gc time)

x = @time @parallel (+) for i=1:200000000
    Int(rand(Bool))
end

0.432549 seconds (3.91 k allocations: 241.138 KiB)

I got good result for parallel here but in following example not.

x2 = 0
@time for i=1:100000
    x2 = Int(rand(Bool)) + x2
end

0.006025 seconds (98.97 k allocations: 1.510 MiB)

x2 = @time @parallel (+) for i=1:100000
    Int(rand(Bool))
end

0.084736 seconds (3.87 k allocations: 239.122 KiB)

2
I guess the overhead of using threads is worthwhile after a certain number of iterations - Maurice Perry
@MauricePerry This isn't threads, it's multiprocessing. Multiprocessing has a lot more overhead than threads since it's fully asynchronous and can even have processes on other computers. @ReD you need have "enough" work on each process for multiprocessing to payoff. Otherwise you should look at using threads via Threads.@threads. - Chris Rackauckas
Would you mind some terminology first? This is not a [Parallel]-process execution but a just-[Concurrent]-scheduling -- As defined (cit:)-- [Parallel] processing is, in sharp contrast to just a [Concurrent] processing, guaranteed to start / perform / finish all thread-level and/or instruction-level tasks executed in a parallel fashion and provides a guaranteed finish of the simultaneously executed code-paths. If it were for trully [Parallel]-execution, there ought be 200E+6 CPU-cores to allow for such [Parallel]-execution. Not here, @parallel decorator never creates CPUs - user3666197
@ReD for details on ( SEQ:setup-overheads, PAR:HPC-payload ) accelerations, do not hesitate to view a post about an ( overhead-aware ) Amdahl-Law + data on Point-of-Diminishing-Returns, as depicted in achievable speedups in similarly motivated problem in >>> stackoverflow.com/a/45562881 - user3666197

2 Answers

1
votes

Doing things in parallel will ALWAYS be less efficient. It is because doing things parallel has always overhead to synchronize. Anyway the hope is, to get the result earlies on wall time than a pure sequential call (one computer, single core)

Your number are astonishing, and I found the cause.

First of all, allow to use all cores, goto into REPL

julia> nworkers
4

# original case to get correct relative times
julia> x = 0
julia> @time for i=1:200000000
          x = Int(rand(Bool)) + x
       end

7.864891 seconds (200.00 M allocations: 2.980 GiB, 1.62% gc time)

julia> x = @time @parallel (+) for i=1:200000000
          Int(rand(Bool))
       end
0.350262 seconds (4.08 k allocations: 254.165 KiB)
99991471

# now a correct benchmark
julia> function test()
         x = 0
         for i=1:200000000
           x = Int(rand(Bool)) + x
         end
       end
julia> @time test()
0.465478 seconds (4 allocations: 160 bytes)

What happend?

Your first test case uses an global variable x. And that is terrible slow. The case access 200 000 000 times a slow variable.

In the second test case the global variable x is assigned just one time, so the poor performance is not coming into account

In my test case there is no global variable. I used a local variable. Local variables are much faster (due to better compiler optimizations)

0
votes

Q: is there any way to run some loop with less amount of iteration in parallel way better than sequential way?


A: Yes.

1) Acquire more resources ( processors to compute, memory to store ) if all this ought get sense

2) Arrange the workflow smarter - to benefit from register-based code, from harnessing the cache-lines's sizes upon each first fetch, deploy re-use where possible ( hard work? yes, it is hard work, but why to repetitively pay 150+ [ns] instead of having paid this once and reuse well-aligned neighbouring cells just within ~ 30 [ns] latency-costs ( if NUMA permits )? ). Smarter workflow also often means code re-designs with respect to increasing the resulting assembly-code "density"-of-computations and tweaking the code so as to better by-pass the ( optimising-)-superscalar processor hardware design tricks, which are of no use / positive-benefit in highly-tuned HPC computing payloads.

3) Avoid headbangs into any blocking resources & bottlenecks ( central singularities alike a host's hardware unique source-of-randomness, IO-devices et al )

4) Get familiar with your optimising compilers internal options and "shortcuts" -- sometimes anti-patterns get generated at a cost of extended run-times

5) Get maximum from your underlying operating system's tweaking. Not doing this, your optimised code still waits ( and a lot ) in O/S-scheduler's queue