wrapped parallel code way slower than non-wrapped

Question

I am relatively new to Julia and I am having some issues when trying to parallelise. I have tried both pmap and @parallel approaches and I encounter the same issue. When I run something like:

addprocs(7)
A0=zeros(a_size, b_size, c_size)
A=SharedArray{Float64}(a_size,b_size,c_size)
toler=1e-3
maxit=1000
 while (metric1>toler) && (iter1<maxit)
`@inbounds` `@sync` `@parallel`  for i in 1:c_size
 A[:,:,i]=compute_A(fs,A0[:,:,i],i)
end
A_new=sdata(A)
metric1=maximum(abs.((A_new-A0)))
A0=copy(A_new)
iter1=iter1+1
println("$(iter1)  $(metric1)")
end

where the inputs of the function compute_A are:

fs is DataType defined by me
A0 is an array
i is the index I'm looping over (dimension c_size)

this seems to be working fine (even if instead of shared arrays and @parallel loop I use pmap)

However, when I use a wrap up function for this code, like:

wrap(fs::DataType,  toler::Float64, maxit::Int)
A0=zeros(a_size, b_size, c_size)
 A=SharedArray{Float64}(a_size,b_size,c_size)

  while (metric1>toler) && (iter1<maxit)
 `@inbounds` `@sync` `@parallel`  for i in 1:c_size
  A[:,:,i]=compute_A(fs,A0[:,:,i],i)
 end
 A_new=sdata(A)
 metric1=maximum(abs.((A_new-A0)))
 A0=copy(A_new)
 iter1=iter1+1
 println("$(iter1)  $(metric1)")
 end
end

Calling this wrap(fs, 1e-3, 1000) function runs WAY SLOWER than the other one (like 6 vs 600 seconds). It seems extremely weird and I don't understand what I am doing wrong, but there is definitely something I'm missing, so I was hoping I could get some help here. I am using Julia v0.6.0. Thanks a lot for your time and help.

Dan Getz Dan Getz · Accepted Answer · 2017-09-26T21:23:43

My guess (without ability to run code in question, this is really a guess) is A0 is not a SharedArray, and when defined globally, it is effectively defined in all processors, so no communication is required for it during calculation (did you notice A0 is a constant in your calculation?).

In the wrapped version it is defined locally in one process and communicated constantly to other processes. Hence the longer running times.

It is better to have maximal locality of data. If you define A0 as a SharedArray of zeros using:

A0 = SharedArray{Float64,3}(a_size,b_size,c_size,
                            init = S -> S[Base.localindexes(S)] .= 0)

in both wrapped and unwrapped version. Furthermore, keeping each [:,:,i] slice on one processor would be ideal (by having nworkers() divide c_size).

NOTE: I'm not sure what kind of editing has been going on before putting the code in the question, but if A0 is truly a constant zero tensor, there might be better ways to refactor the code. If A0 is some other tensor, then try:

A0 = SharedArray(otherTensor)

A relevant reference is SharedArray documentation which also details how to better split a SharedArray 3D tensor between processors, so slices remain inside a process for much better performance.

wrapped parallel code way slower than non-wrapped

1 Answers