Why do simplified maths equations run (slightly) slower than their equivalences with far more operations in Julia-Lang?

Question

In a C++ course, I was taught that tricks like avoid repeating computation, use more additions instead of more multiplications, avoiding powers and so on to improve performance. When I tried them to optimize code in Julia-Lang, however, I was surprised with the opposite result.

For example, here are a few equations without maths optimization (All codes written in Julia 1.1, not JuliaPro):

function OriginalFunction( a,b,c,d,E )
    # Oprations' count:
    # sqrt: 4
    # ^: 14
    # * : 14
    # / : 10
    # +: 20
    # -: 6
    # = : 0+4
    x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
    y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)

    x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
    y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)

    return [ [x1;y1] [x2;y2] ]
end

I optimized them with a few tricks, including:

(a*b + a*c) -> a*(b+c) because addition is faster than multiplication.
a^2 -> a*a to avoid power operation.
If there is a long operation used at least twice, assign it to a variable to avoid repeated computation. For example:

x = a * (1+c^2); y = b * (1+c^2)
->
temp = 1+c^2
x = a * temp; y = b * temp

Convert Int to Float64, so that computer doesn't have to do it (in run time or compile time). For example:

1/x -> 1.0/x

The result gives equivalent equations with far fewer operations:

function SimplifiedFunction( a,b,c,d,E )
    # Oprations' count:
    # sqrt: 1
    # ^: 0
    # *: 9
    # /: 1
    # +: 4
    # -: 6
    # = : 5+4
    temp1 = sqrt(E)
    temp2 = c*(b - d) + a
    temp3 = 1.0/(1.0+c*c)
    temp4 = d - (c*(c*(d - b) - a))*temp3
    temp5 = (c*temp1)*temp3
    x1 = temp3*(temp2-temp1)
    y1 = temp4-temp5

    x2 = temp3*(temp2+temp1)
    y2 = temp4+temp5

    return [ [x1;y1] [x2;y2] ]
end

Then I tested them with the following function, expecting the version with far less operations to fun faster or the same:

function Test2Functions( NumberOfTests::Real )
    local num = Int(NumberOfTests)
    # -- Generate random numbers
    local rands = Array{Float64,2}(undef, 5,num)
    for i in 1:num
        rands[:,i:i] = [rand(); rand(); rand(); rand(); rand()]
    end

    local res1 = Array{Array{Float64,2}}(undef, num)
    local res2 = Array{Array{Float64,2}}(undef, num)
    # - Test OriginalFunction
    @time for i in 1:num
        a,b,c,d,E = rands[:,i]
        res1[i] = OriginalFunction( a,b,c,d,E )
    end
    # - Test SimplifiedFunction
    @time for i in 1:num
        a,b,c,d,E = rands[:,i]
        res2[i] = SimplifiedFunction( a,b,c,d,E )
    end
    return res1, res2
end
Test2Functions( 1e6 )

However, it turned out that the 2 functions use the same amount of memory allocation, but the simplified one has more garbage collection time, and runs about 5% slower:

julia> Test2Functions( 1e6 )
  1.778731 seconds (7.00 M allocations: 503.540 MiB, 47.35% gc time)
  1.787668 seconds (7.00 M allocations: 503.540 MiB, 50.92% gc time)

julia> Test2Functions( 1e6 )
  1.969535 seconds (7.00 M allocations: 503.540 MiB, 52.05% gc time)
  2.221151 seconds (7.00 M allocations: 503.540 MiB, 56.68% gc time)

julia> Test2Functions( 1e6 )
  1.946441 seconds (7.00 M allocations: 503.540 MiB, 55.23% gc time)
  2.099875 seconds (7.00 M allocations: 503.540 MiB, 59.33% gc time)

julia> Test2Functions( 1e6 )
  1.836350 seconds (7.00 M allocations: 503.540 MiB, 53.37% gc time)
  2.011242 seconds (7.00 M allocations: 503.540 MiB, 58.43% gc time)

julia> Test2Functions( 1e6 )
  1.856081 seconds (7.00 M allocations: 503.540 MiB, 53.44% gc time)
  2.002087 seconds (7.00 M allocations: 503.540 MiB, 58.21% gc time)

julia> Test2Functions( 1e6 )
  1.833049 seconds (7.00 M allocations: 503.540 MiB, 53.55% gc time)
  1.996548 seconds (7.00 M allocations: 503.540 MiB, 58.41% gc time)

julia> Test2Functions( 1e6 )
  1.846894 seconds (7.00 M allocations: 503.540 MiB, 53.53% gc time)
  2.053529 seconds (7.00 M allocations: 503.540 MiB, 58.30% gc time)

julia> Test2Functions( 1e6 )
  1.896265 seconds (7.00 M allocations: 503.540 MiB, 54.11% gc time)
  2.083253 seconds (7.00 M allocations: 503.540 MiB, 58.10% gc time)

julia> Test2Functions( 1e6 )
  1.910244 seconds (7.00 M allocations: 503.540 MiB, 53.79% gc time)
  2.085719 seconds (7.00 M allocations: 503.540 MiB, 58.36% gc time)

Could anyone kindly tell me why? 5% speed probably isn't something worth fighting for even in some performance critical codes, but I'm still curious: how can I help Julia compiler to produce faster code?

I don't know Julia, but are you sure that many of your optimizations are not already done by an optimizing stage during compilation? — Walter Tross
I didn't know if or to what extent these optimizations help, or how exactly the compiler works. That was why I did the test. I was assuming time(compiler_optimization(my_crappy_optimization(f))) <= time(compiler_optimization(f)) — zyc
You already got your main answer, but I just wanted to add that you should get rid of all local declarations, since that does nothing here but add visual noise. And this assignment is a bit odd: rands[:,i:i] = [rand(); rand(); rand(); rand(); rand()] with the index i repeated as i:i. Instead just write: rands[:, i] .= rand.(). Oh, and just to emphasize it: always use BenchmarkTools for benchmarking :) — DNF
I know rands[:,i:i] looks odd. I developed this style, because of an odd bug I fixed: arr1[i, :] = arr2[i, :] leads to error, since Julia takes arr2[i, :] as a 1D Array, in stead of a 2D row vector. I fixed that with arr1[i, :] = arr2[i:i, :]. Since this confuses my fuzzy brain to no ends, I just be conservative and err on the side of being repetitive. — zyc
Thanks for your interesting use of broadcasting. I should try use it more often. From my testing: for i in 1:length(randbuffer); randbuffer[i]=rand(); end randbuffer .= rand.() For loop is much faster for a small array, whereas broadcasting is faster for a large array. Anyway, speed is irrelevant for small number. — zyc

Bogumił Kamiński Bogumił Kamiński · Accepted Answer · 2019-05-23T19:58:56

The reason is that you are running into garbage collection in the second loop (and not in the first). If you do GC.gc() before the loops you get more comparable results:

function Test2Functions( NumberOfTests::Real )
    local num = Int(NumberOfTests)
    # -- Generate random numbers
    local rands = Array{Float64,2}(undef, 5,num)
    for i in 1:num
        rands[:,i:i] = [rand(); rand(); rand(); rand(); rand()]
    end

    local res1 = Array{Array{Float64,2}}(undef, num)
    local res2 = Array{Array{Float64,2}}(undef, num)
    # - Test OriginalFunction
    GC.gc()
    @time for i in 1:num
        a,b,c,d,E = rands[:,i]
        res1[i] = OriginalFunction( a,b,c,d,E )
    end
    # - Test SimplifiedFunction
    GC.gc()
    @time for i in 1:num
        a,b,c,d,E = rands[:,i]
        res2[i] = SimplifiedFunction( a,b,c,d,E )
    end
    return res1, res2
end

# call this twice as the first time you may have precompilation issues
Test2Functions( 1e6 )
Test2Functions( 1e6 )

However, in general to do benchmarking it is better to use BenchmarkTools.jl package.

julia> function OriginalFunction()
           a,b,c,d,E = rand(5)
           x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
           y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)
           x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
           y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)
           return [ [x1;y1] [x2;y2] ]
       end
OriginalFunction (generic function with 2 methods)

julia>

julia> function SimplifiedFunction()
           a,b,c,d,E = rand(5)
           temp1 = sqrt(E)
           temp2 = c*(b - d) + a
           temp3 = 1.0/(1.0+c*c)
           temp4 = d - (c*(c*(d - b) - a))*temp3
           temp5 = (c*temp1)*temp3
           x1 = temp3*(temp2-temp1)
           y1 = temp4-temp5
           x2 = temp3*(temp2+temp1)
           y2 = temp4+temp5
           return [ [x1;y1] [x2;y2] ]
       end
SimplifiedFunction (generic function with 2 methods)

julia>

julia> using BenchmarkTools

julia> @btime OriginalFunction()
  136.211 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
 -0.609035  0.954271
  0.724708  0.926523

julia> @btime SimplifiedFunction()
  137.201 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
 0.284514  1.58639
 0.922347  0.979835

julia> @btime OriginalFunction()
  137.301 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
 -0.109814  0.895533
  0.365399  1.08743

julia> @btime SimplifiedFunction()
  136.429 ns (7 allocations: 528 bytes)
2×2 Array{Float64,2}:
 0.516157  1.07871
 0.219441  0.361133

And we see that they have comparable performance. In general you can expect that Julia and LLVM compilers will do most of the optimizations of this kind for you (of course it is not guaranteed always but in this case it seems it happens).

EDIT

I have simplified the functions as follows:

function OriginalFunction( a,b,c,d,E )
    x1 = (1/(1+c^2))*(-c*d+a+c*b-sqrt(E))
    y1 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)-(c*sqrt(E))/(1+c^2)
    x2 = (1/(1+c^2))*(-c*d+a+c*b+sqrt(E))
    y2 = d-(c^2*d)/(1+c^2)+(c*a)/(1+c^2)+(c^2*b)/(1+c^2)+(c*sqrt(E))/(1+c^2)
    x1, y1, x2, y2
end

function SimplifiedFunction( a,b,c,d,E )
    temp1 = sqrt(E)
    temp2 = c*(b - d) + a
    temp3 = 1.0/(1.0+c*c)
    temp4 = d - (c*(c*(d - b) - a))*temp3
    temp5 = (c*temp1)*temp3
    x1 = temp3*(temp2-temp1)
    y1 = temp4-temp5
    x2 = temp3*(temp2+temp1)
    y2 = temp4+temp5
    x1, y1, x2, y2
end

To concentrate only on the core of the calculations and run @code_native on them. Here are they (stripped of comments to shorten them).

        .text
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $112, %rsp
        vmovaps %xmm10, -16(%rbp)
        vmovaps %xmm9, -32(%rbp)
        vmovaps %xmm8, -48(%rbp)
        vmovaps %xmm7, -64(%rbp)
        vmovaps %xmm6, -80(%rbp)
        vmovsd  56(%rbp), %xmm8         # xmm8 = mem[0],zero
        vxorps  %xmm4, %xmm4, %xmm4
        vucomisd        %xmm8, %xmm4
        ja      L229
        vmovsd  48(%rbp), %xmm9         # xmm9 = mem[0],zero
        vmulsd  %xmm9, %xmm3, %xmm5
        vsubsd  %xmm5, %xmm1, %xmm5
        vmulsd  %xmm3, %xmm2, %xmm6
        vaddsd  %xmm5, %xmm6, %xmm10
        vmulsd  %xmm3, %xmm3, %xmm6
        movabsq $526594656, %rax        # imm = 0x1F633260
        vmovsd  (%rax), %xmm7           # xmm7 = mem[0],zero
        vaddsd  %xmm7, %xmm6, %xmm0
        vdivsd  %xmm0, %xmm7, %xmm7
        vsqrtsd %xmm8, %xmm8, %xmm4
        vsubsd  %xmm4, %xmm10, %xmm5
        vmulsd  %xmm5, %xmm7, %xmm8
        vmulsd  %xmm9, %xmm6, %xmm5
        vdivsd  %xmm0, %xmm5, %xmm5
        vsubsd  %xmm5, %xmm9, %xmm5
        vmulsd  %xmm3, %xmm1, %xmm1
        vdivsd  %xmm0, %xmm1, %xmm1
        vaddsd  %xmm5, %xmm1, %xmm1
        vmulsd  %xmm2, %xmm6, %xmm2
        vdivsd  %xmm0, %xmm2, %xmm2
        vaddsd  %xmm1, %xmm2, %xmm1
        vmulsd  %xmm3, %xmm4, %xmm2
        vdivsd  %xmm0, %xmm2, %xmm0
        vsubsd  %xmm0, %xmm1, %xmm2
        vaddsd  %xmm10, %xmm4, %xmm3
        vmulsd  %xmm3, %xmm7, %xmm3
        vaddsd  %xmm1, %xmm0, %xmm0
        vmovsd  %xmm8, (%rcx)
        vmovsd  %xmm2, 8(%rcx)
        vmovsd  %xmm3, 16(%rcx)
        vmovsd  %xmm0, 24(%rcx)
        movq    %rcx, %rax
        vmovaps -80(%rbp), %xmm6
        vmovaps -64(%rbp), %xmm7
        vmovaps -48(%rbp), %xmm8
        vmovaps -32(%rbp), %xmm9
        vmovaps -16(%rbp), %xmm10
        addq    $112, %rsp
        popq    %rbp
        retq
L229:
        movabsq $throw_complex_domainerror, %rax
        movl    $72381680, %ecx         # imm = 0x45074F0
        vmovapd %xmm8, %xmm1
        callq   *%rax
        ud2
        ud2
        nop

and

        .text
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $64, %rsp
        vmovaps %xmm7, -16(%rbp)
        vmovaps %xmm6, -32(%rbp)
        vmovsd  56(%rbp), %xmm0         # xmm0 = mem[0],zero
        vxorps  %xmm4, %xmm4, %xmm4
        vucomisd        %xmm0, %xmm4
        ja      L178
        vmovsd  48(%rbp), %xmm4         # xmm4 = mem[0],zero
        vsqrtsd %xmm0, %xmm0, %xmm0
        vsubsd  %xmm4, %xmm2, %xmm5
        vmulsd  %xmm3, %xmm5, %xmm5
        vaddsd  %xmm1, %xmm5, %xmm5
        vmulsd  %xmm3, %xmm3, %xmm6
        movabsq $526593928, %rax        # imm = 0x1F632F88
        vmovsd  (%rax), %xmm7           # xmm7 = mem[0],zero
        vaddsd  %xmm7, %xmm6, %xmm6
        vdivsd  %xmm6, %xmm7, %xmm6
        vsubsd  %xmm2, %xmm4, %xmm2
        vmulsd  %xmm3, %xmm2, %xmm2
        vsubsd  %xmm1, %xmm2, %xmm1
        vmulsd  %xmm3, %xmm1, %xmm1
        vmulsd  %xmm1, %xmm6, %xmm1
        vsubsd  %xmm1, %xmm4, %xmm1
        vmulsd  %xmm3, %xmm0, %xmm2
        vmulsd  %xmm2, %xmm6, %xmm2
        vsubsd  %xmm0, %xmm5, %xmm3
        vmulsd  %xmm3, %xmm6, %xmm3
        vsubsd  %xmm2, %xmm1, %xmm4
        vaddsd  %xmm5, %xmm0, %xmm0
        vmulsd  %xmm0, %xmm6, %xmm0
        vaddsd  %xmm1, %xmm2, %xmm1
        vmovsd  %xmm3, (%rcx)
        vmovsd  %xmm4, 8(%rcx)
        vmovsd  %xmm0, 16(%rcx)
        vmovsd  %xmm1, 24(%rcx)
        movq    %rcx, %rax
        vmovaps -32(%rbp), %xmm6
        vmovaps -16(%rbp), %xmm7
        addq    $64, %rsp
        popq    %rbp
        retq
L178:
        movabsq $throw_complex_domainerror, %rax
        movl    $72381680, %ecx         # imm = 0x45074F0
        vmovapd %xmm0, %xmm1
        callq   *%rax
        ud2
        ud2
        nopl    (%rax,%rax)

Probably you will not want to digest it in detail, but you can see that the simplified function uses a few less instructions, but only a few, which can be surprising if you compare the original codes. For example both codes call sqrt only once (so multiple calls to sqrt in the fist function got optimized out).

Why do simplified maths equations run (slightly) slower than their equivalences with far more operations in Julia-Lang?

2 Answers