Julia language. How to beat vectorized operations?

Question

It is said that Julia for-loops are as fast as vectorized operations and even faster (if they are used properly). I have two pieces of code. The idea is to find a sample statistic for a given 0-1 sequence, which is x (in these two examples i'm trying to find a sum, but there are more complicated examples, i'm just trying to understand a general meaning of performance pitfalls in my code). The first looks like:

S = 2 * sum(x) - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)

and the second is something "naive" and classical:

S = 0
for i in eachindex(x)
    S += x[i]
end
S = 2 * S - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)

Using @benchmark i've found that the running time of the first example is about 11.8 ms, and for the second is 38 ms.

This example is of very importance for me, because there are a lot of other places, where vectorization isn't possible, so i want to do computations in devectorized "manner" as fast as in vectorized.

Is there any ideas why devectorized code is likely 4x times slower than vectorized? The type stability is OK, there is no unnecessary big memory allocations and etc.

The code for the first function is:

function frequency_monobit_test1( x :: Array{Int8, 1}, n = 0)
# count 1 and 0 in sequence, we want the number
# of 1's and 0's to be approximately the same
# reccomendation n >= 100
# decision Rule(at 1% level): if pval < 0.01 -> non-random
if (n == 0)
    n = length(x)
end
S = 2 * sum(x) - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)
return pval

The second is:

function frequency_monobit_test2( x :: Array{Int8, 1}, n = 0)
# count 1 and 0 in sequence, we want the number
# of 1's and 0's to be approximately the same
# reccomendation n >= 100
# decision Rule(at 1% level): if pval < 0.01 -> non-random
if (n == 0)
    n = length(x)
end
S = 0
@inbounds for i in eachindex(x)
    S += x[i]
end
S = 2 * S - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)
return pval

Another thing to try is to put @inbounds @simd in front of the for statement — Dan Getz
Please show us the actual code you run to get the benchmarks. — David P. Sanders
Can you provide some input to test the code on? How can one generate x? Also, if you want to time the code properly, you should use github.com/JuliaCI/BenchmarkTools.jl , don't just use @time in global scope. — DNF
The only difference between your two code snippets is that you call sum in the first, and manually accumulate in a loop in the second. They should be equally fast, if you use @inbounds. Also, make sure that you initialize S to the correct type. Is x integer or float? Best to initialize with S = zero(eltype(x)) instead of S = 0. And use BenchmarkTools.jl, as I mentioned. — DNF

DNF DNF · Accepted Answer · 2017-07-02T12:46:04

This is a curious case. There seems to be a performance problem when accumulating Int8s in an Int64 variable.

Let's try these functions:

using SpecialFunctions, BenchmarkTools

function frequency_monobit_test1(x, n=length(x))
    S = sum(x)
    return erfc(abs(2S - n) / sqrt(2n))
end

function frequency_monobit_test3(typ::Type{<:Integer}, x, n=length(x))
    S = zero(typ)
    for i in eachindex(x)
        @inbounds S += x[i]
    end
    return erfc(abs(2S - n) / sqrt(2n))
end

Initializing some vectors

N = 2^25;
x64 = rand(0:1, N);
x8 = rand(Int8[0, 1], N);
xB = rand(Bool, N);
xb = bitrand(N);

Benchmarking:

For Int64 input:

julia> @btime frequency_monobit_test1($x64)
  17.540 ms (0 allocations: 0 bytes)
0.10302739916042186

julia> @btime frequency_monobit_test3(Int64, $x64)
  17.796 ms (0 allocations: 0 bytes)
0.10302739916042186

julia> @btime frequency_monobit_test3(Int32, $x64)
  892.715 ms (67106751 allocations: 1023.97 MiB)
0.10302739916042186

We see that sum and an explicit loop are equally fast, and that initializing with an Int32 is a bad idea.

For Int32 input:

julia> @btime frequency_monobit_test1($x32)
  9.137 ms (0 allocations: 0 bytes)
0.2386386867682374

julia> @btime frequency_monobit_test3(Int64, $x32)
  8.839 ms (0 allocations: 0 bytes)
0.2386386867682374

julia> @btime frequency_monobit_test3(Int32, $x32)
  7.274 ms (0 allocations: 0 bytes)
0.2386386867682374

sum and loop are similar in speed. Accumulating into Int32 saves a bit of time.

Int8 input:

julia> @btime frequency_monobit_test1($x8)
  5.681 ms (0 allocations: 0 bytes)
0.16482999123032094

julia> @btime frequency_monobit_test3(Int64, $x8)
  19.517 ms (0 allocations: 0 bytes)
0.16482999123032094

julia> @btime frequency_monobit_test3(Int32, $x8)
  4.815 ms (0 allocations: 0 bytes)
0.16482999123032094

Explicit loop if slightly faster when accumulating into an Int32, but holy cow! what's going on with Int64? That is super slow!

What about Bool?

julia> @btime frequency_monobit_test1($xB)
  9.627 ms (0 allocations: 0 bytes)
0.7728544347518309

julia> @btime frequency_monobit_test3(Int64, $xB)
  9.629 ms (0 allocations: 0 bytes)
0.7728544347518309

julia> @btime frequency_monobit_test3(Int32, $xB)
  4.815 ms (0 allocations: 0 bytes)
0.7728544347518309

Loop and sum have same speed but, accumulating into Int32 saves half the time.

Now we'll try a BitArray:

julia> @btime frequency_monobit_test1($xb)
  259.044 μs (0 allocations: 0 bytes)
0.7002576522570715

julia> @btime frequency_monobit_test3(Int64, $xb)
  19.423 ms (0 allocations: 0 bytes)
0.7002576522570715

julia> @btime frequency_monobit_test3(Int32, $xb)
  19.430 ms (0 allocations: 0 bytes)
0.7002576522570715

So, sum on a BitArray is crazy fast, because you can perform chunked additions, but extracting single elements in a loop has some overhead.

Conclusions:

You can get similar or better performance with a loop than with sum, except for BitArrays which is a very special case.
If you know the length of your array, and know that Int32 is enough to hold your sum, then that's a time saver.
There is something very strange going on when accumulating Int8s into an Int64. I don't know why the performance is so bad.
If you are only interested in zeros and ones, use an array of Bools, not of Int8s, and possibly accumulate into an Int32.
BitArrays can be super fast in certain cases.
sum on a vector of Int8s is strangely fast, twice as fast as sum(::Vector{Bool}).

Julia language. How to beat vectorized operations?

1 Answers