It is said that Julia for-loops are as fast as vectorized operations and even faster (if they are used properly). I have two pieces of code. The idea is to find a sample statistic for a given 0-1 sequence, which is x (in these two examples i'm trying to find a sum, but there are more complicated examples, i'm just trying to understand a general meaning of performance pitfalls in my code). The first looks like:
S = 2 * sum(x) - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)
and the second is something "naive" and classical:
S = 0
for i in eachindex(x)
S += x[i]
end
S = 2 * S - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)
Using @benchmark i've found that the running time of the first example is about 11.8 ms, and for the second is 38 ms.
This example is of very importance for me, because there are a lot of other places, where vectorization isn't possible, so i want to do computations in devectorized "manner" as fast as in vectorized.
Is there any ideas why devectorized code is likely 4x times slower than vectorized? The type stability is OK, there is no unnecessary big memory allocations and etc.
The code for the first function is:
function frequency_monobit_test1( x :: Array{Int8, 1}, n = 0)
# count 1 and 0 in sequence, we want the number
# of 1's and 0's to be approximately the same
# reccomendation n >= 100
# decision Rule(at 1% level): if pval < 0.01 -> non-random
if (n == 0)
n = length(x)
end
S = 2 * sum(x) - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)
return pval
The second is:
function frequency_monobit_test2( x :: Array{Int8, 1}, n = 0)
# count 1 and 0 in sequence, we want the number
# of 1's and 0's to be approximately the same
# reccomendation n >= 100
# decision Rule(at 1% level): if pval < 0.01 -> non-random
if (n == 0)
n = length(x)
end
S = 0
@inbounds for i in eachindex(x)
S += x[i]
end
S = 2 * S - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)
return pval
@inbounds @simd
in front of thefor
statement – Dan Getzx
? Also, if you want to time the code properly, you should use github.com/JuliaCI/BenchmarkTools.jl , don't just use@time
in global scope. – DNFsum
in the first, and manually accumulate in a loop in the second. They should be equally fast, if you use@inbounds
. Also, make sure that you initializeS
to the correct type. Isx
integer or float? Best to initialize withS = zero(eltype(x))
instead ofS = 0
. And useBenchmarkTools.jl
, as I mentioned. – DNF