10
votes

I know that questions about multi-threading performance in Julia have already been asked (e.g. here), but they involve fairly complex code in which many things could be at play.

Here, I am running a very simple loop on multiple threads using Julia v1.5.3 and the speedup doesn't seem to scale up very well when compared to running the same loop with, for instance, Chapel.

I would like to know what I am doing wrong and how I could run multi-threading in Julia more efficiently.

Sequential code

using BenchmarkTools

function slow(n::Int, digits::String)
    total = 0.0
    for i in 1:n
        if !occursin(digits, string(i))
            total += 1.0 / i
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Time: 8.034s

Shared memory parallelism with Threads.@threads on 4 threads

using BenchmarkTools
using Base.Threads

function slow(n::Int, digits::String)
    total = Atomic{Float64}(0)
    @threads for i in 1:n
        if !occursin(digits, string(i))
            atomic_add!(total, 1.0 / i)
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Time: 6.938s
Speedup: 1.2

Shared memory parallelism with FLoops on 4 threads

using BenchmarkTools
using FLoops

function slow(n::Int, digits::String)
    total = 0.0
    @floop for i in 1:n
        if !occursin(digits, string(i))
            @reduce(total += 1.0 / i)
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Time: 10.850s
No speedup: slower than the sequential code.

Tests on various numbers of threads (different hardware)

I tested the sequential and Threads.@threads code on a different machine and experimented with various numbers of threads.

Here are the results:

Number of threads Speedup
2 1.2
4 1.2
8 1.0 (no speedup)
16 0.9 (the code takes longer to run than the sequential code)

For heavier computations (n = 1e9 in the code above) which would minimize the relative effect of any overhead, the results are very similar:

Number of threads Speedup
2 1.1
4 1.3
8 1.1
16 0.8 (the code takes longer to run than the sequential code)

For comparison: same loop with Chapel showing perfect scaling

Code run with Chapel v1.23.0:

use Time;
var watch: Timer;
config const n = 1e8: int;
config const digits = "9";
var total = 0.0;
watch.start();
forall i in 1..n with (+ reduce total) {
  if (i: string).find(digits) == -1 then
    total += 1.0 / i;
 }
watch.stop();
writef("total = %{###.###############} in %{##.##} seconds\n",
        total, watch.elapsed());

First run (same hardware as the first Julia tests):

Number of threads Time (s) Speedup
1 13.33 n/a
2 7.34 1.8

Second run (same hardware):

Number of threads Time (s) Speedup
1 13.59 n/a
2 6.83 2.0

Third run (different hardware):

Number of threads Time (s) Speedup
1 19.99 n/a
2 10.06 2.0
4 5.05 4.0
8 2.54 7.9
16 1.28 15.6
2
Also, is the equivalent chapel code thread-safe? (I don't know chapel well)Oscar Smith
Yes, it is thread-safe.prosoitos
Not directly related to the question, but while doing my own timings, I converted the timed computation in Chapel to the more compact/elegant: total = + reduce [i in 1..n] if (i: string).find(digits) == -1 then 1.0 / i else 0.0;Brad
Thank you Brad for the neat compact version! I will leave my less elegant version in the answer as it may be easier to read for non-Chapel experts.prosoitos

2 Answers

4
votes

Someone can make a much more detailed analysis than me but the main reason naive Julia threading is performing badly is that your "task" in each iteration is way too light. Using an atomic lock, in this case, will imply huge overhead because all threads are just waiting for the lock way too often.

Since your Chapel code is doing a mapreduce, we can also try a parallel mapreduce in Julia:


julia> function slow(n::Int, digits::String)
           total = 0.0
           for i in 1:n
               if !occursin(digits, string(i))
                   total += 1.0 / i
               end
           end
           "total = $total"
       end
slow (generic function with 1 method)

julia> @btime slow(Int64(1e5), "9")
  6.021 ms (200006 allocations: 9.16 MiB)
"total = 9.692877792106202"

julia> using ThreadsX

julia> function slow_thread_thx(n::Int, digits::String)
           total = ThreadsX.mapreduce(+,1:n) do i
               if !occursin(digits, string(i))
                   1.0 / i
               else
                   0.0
               end
           end
           "total = $total"
       end

julia> @btime slow_thread_thx(Int64(1e5), "9")
  1.715 ms (200295 allocations: 9.17 MiB)
"total = 9.692877792106195"

With 4 threads. I've tested with other numbers of threads and confirmed the scaling is pretty linear.

Btw, just as a general tip, you should try to avoid printing in a benchmarked code because it makes a mess when timed repeatedly and also if your task is fast, STDIO can take nonnegligible time.

3
votes

As jling suggests in the comments on their answer the problem here is most likely that the Julia code is allocating lots of memory that needs to be garbage collected. Chapel is, to my understanding, not a garbage-collected language and that could explain why this example scales more linearly. As a small test of this hypothesis, I benchmarked the following code that performs the same operations but with preallocated Vector{UInt8} instead of String:

using BenchmarkTools
using Transducers
using Distributed

function string_vector!(a::Vector{UInt8}, x::Unsigned)
    n = ndigits(x)
    length(a) < n && error("Vector too short")
    i = n
    @inbounds while i >= 1
        d, r = divrem(x, 0x0a)
        a[i] = 0x30 + r
        x = oftype(x, d)
        i -= 1
    end
    a
end

function slow_no_garbage(n::UInt, digits::String)
    digits = collect(codeunits(digits))
    thread_strings = [zeros(UInt8, 100) for _ in 1:Threads.nthreads()]
    fun(i) = if Base._searchindex(string_vector!(thread_strings[Threads.threadid()], i), digits, 1) == 0
            1.0 / i
        else
            0.0
        end
    total = foldxt(+, Map(fun), 0x1:n)
    "total = $total"
end
println(@btime slow_no_garbage(UInt(1e8), "9"))

I do not recommend using this code (especially since, because the numbers are always growing in length I don't properly clear the thread buffer between iterations, although that is easily fixed). However, it results in almost linear scaling with the number of threads (table at the end of the answer).

As jling also mentioned, if a lot of garbage is created distribution may be better than threading. The following two code snippets use Transducers.jl to run the code first using threads:

using BenchmarkTools
using Transducers

function slow_transducers(n::Int, digits::String)
    fun(i) = if !occursin(digits, string(i))
            1.0 / i
        else
            0.0
        end
    total = foldxt(+, Map(fun), 1:n)
    "total = $total"
end

println(@btime slow_transducers(Int64(1e8), "9"))

and then distributed to separate Julia processes (taking the number of processes as the first command-line argument):

using BenchmarkTools
using Transducers
using Distributed

function slow_distributed(n::Int, digits::String)
    fun(i) = if !occursin(digits, string(i))
            1.0 / i
        else
            0.0
        end
    total = foldxd(+, Map(fun), 1:n)
   "total = $total"
end

addprocs(parse(Int, ARGS[1]))
println(@btime slow_distributed(Int64(1e8), "9"))

The following table shows the results of running all versions with different number of threads/processes:

n jling slow_transducers slow_distributed slow_no_garbage Chapel
1 4.242 s 4.224 s 4.241 s 2.743 s 7.32 s
2 2.952 s 2.958 s 2.168 s 1.447 s 3.73 s
4 2.135 s 2.147 s 1.163 s 0.716105 s 1.9 s
8 1.742 s 1.741 s 0.859058 s 0.360469 s

Speedup:

n jling slow_transducers slow_distributed slow_no_garbage Chapel
1 1.0 1.0 1.0 1.0 1.0
2 1.43699 1.42799 1.95618 1.89565 1.96247
4 1.98689 1.9674 3.6466 3.83044 3.85263
8 2.43513 2.42619 4.9368 7.60953