2
votes

I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.

In order to learn, I parallelized a code that should benefits from parallelization, but it does not.

The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.

Serial version

tic()
function mean_estimate(N::Int)
   iter = 100000*2
   p = 5000
   vec_mean   = zeros(iter)
   for i = 1:iter
      vec_mean[i]   = mean( rand(p) )
   end
   return mean(vec_mean)
end

a = mean_estimate(0)
toc()

println("The mean is: ", a)

Parallelized version

addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)

tic()
@everywhere function mean_estimate(N::Int)
   iter = 100000
   p = 5000
   vec_mean   = zeros(iter)
   for i = 1:iter
      vec_mean[i]   = mean( rand(p) )
   end
   return mean(vec_mean)
end

the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()

println("The mean is: ", the_mean)

Notes:

  • The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
  • I checked the usage of the two cores with htop, and it seems to be ok.

The outputs I get are:

me@pentium-ws:~/average$ time julia serial.jl 
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215

real    0m2.961s
user    0m2.928s
sys     0m0.116s

and

me@pentium-ws:~/average$ time julia -p 2 parallel.jl 
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994

real    0m7.576s
user    0m11.744s
sys     0m0.308s

I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.

Questions

  • Why is the parallelized version slower? (what I am doing wrong?)
  • Which is the right way to parallelize this program?

Note: I use pmap with vcat because I wish to try with the median too.

Thanks for your help


EDIT

I measured times as @HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.

Array Size   Single thread   Parallel   Ratio
      5000            2.69       2.89    1.07
   100 000          488.77     346.00    0.71
  1000 000         4776.58    4438.09    0.93

I am puzzled about why there is not clear trend with array size.

1
Thanks for reply @HighPerformanceMark , I'm going to perform the test that you suggest and comment the results.user1420303
The result of pmap is not strictly typed. You should have a type conversion or assertion.Chris Rackauckas
You are measuring a lot of different things here including compilation & startup time. Check out docs.julialang.org/en/release-0.5/manual/performance-tips/… for some tips on benchmarking julia code.Alexander Morley
@HighPerformanceMark, I edited the answer with results for larger input arrays.user1420303
Thanks @ChrisRackauckas . I used @ time. I run it twice in the same code trying to follow what is recommended in docs.julialang.org/en/release-0.5/manual/performance-tips/… The code and the output is here pastebin.com/X1HEQF34 I tried to measure the vcat step, here is the code and output pastebin.com/VjxEHcbW It seems to me that vcat process is fast just because takes the means of only two valuesuser1420303

1 Answers

3
votes

You should pay prime attention to suggestions in the comments.

As @ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).

Furthermore, as @AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.

If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:

# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
    vec_mean = zeros(iter)
    for i in eachindex(vec_mean)
        vec_mean[i] = f(p)
    end
    return mean(vec_mean)
end

# run twice, time on second run to get compute time
g(200000, 5000)
@time g(200000, 5000)

### output on my machine
#  2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917

The @time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.

Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with

x = rand(5_000_000)
mean(x)
@time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)

so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.

One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.