1
votes

I am beginning to learn Julia after using Matlab for several years. I started by implementing a simple polynomial multiplication (without FFT) to try and understand the role of type stability. A big part of this project is the requirement for a fast polynomial multiplier. However, I have the following timings which I can't understand at all.

function cauchyproduct(L::Array{Float64},R::Array{Float64})
    # good one for floats
    N = length(L)
    prodterm = zeros(1,2N-1)
    for n=1:N
        Lterm = view(L,1:n)
        Rterm = view(R,n:-1:1)
        prodterm[n] = dot(Lterm,Rterm)
    end

    for n = 1:N-1
        Lterm = view(L,n+1:N)
        Rterm = view(R,N:-1:n+1)
        prodterm[N+n] = dot(Lterm,Rterm)
    end
    prodterm
end



testLength = 10000
goodL = rand(1,testLength)
goodR = rand(1,testLength)
for j in 1:10
    @time cauchyproduct(goodL,goodR)
end
@which cauchyproduct(goodL,goodR)

I get the following timings from 2 sequential runs of this code. These timings from one run to another are completely erratic. In general, the timing I get per test can range between .05s to 2s. Typically, the timings for a single run through the for loop will all have similar timings (as in the example below), but even this isn't always the case. Occasionally, I have it alternate such as .05s .05s 1.9s .04s .05s 2.1s etc etc.

Any idea why this is happening?

  0.544795 seconds (131.08 k allocations: 5.812 MiB)
  0.510395 seconds (120.00 k allocations: 5.340 MiB)
  0.528362 seconds (120.00 k allocations: 5.340 MiB, 0.94% gc time)
  0.507156 seconds (120.00 k allocations: 5.340 MiB)
  0.507566 seconds (120.00 k allocations: 5.340 MiB)
  0.507932 seconds (120.00 k allocations: 5.340 MiB)
  0.527383 seconds (120.00 k allocations: 5.340 MiB)
  0.513301 seconds (120.00 k allocations: 5.340 MiB, 0.83% gc time)
  0.509347 seconds (120.00 k allocations: 5.340 MiB)
  0.509177 seconds (120.00 k allocations: 5.340 MiB)
  0.052247 seconds (120.00 k allocations: 5.340 MiB, 7.95% gc time)
  0.049644 seconds (120.00 k allocations: 5.340 MiB)
  0.047275 seconds (120.00 k allocations: 5.340 MiB)
  0.049163 seconds (120.00 k allocations: 5.340 MiB)
  0.049029 seconds (120.00 k allocations: 5.340 MiB)
  0.054050 seconds (120.00 k allocations: 5.340 MiB, 8.36% gc time)
  0.047010 seconds (120.00 k allocations: 5.340 MiB)
  0.051240 seconds (120.00 k allocations: 5.340 MiB)
  0.050961 seconds (120.00 k allocations: 5.340 MiB)
  0.049841 seconds (120.00 k allocations: 5.340 MiB, 4.90% gc time)

Edit: The timings shown are obtained by executing the code beneath the defined functions twice in a row. Specifically, the code block

goodL = rand(1,testLength)
goodR = rand(1,testLength)
for j in 1:10
    @time cauchyproduct(goodL,goodR)
end

gives vastly different timings on different runs (without recompiling the functions above it). In all of the timings, the same method of cauchyproduct (the top version) is being called. Hopefully this clarifies the problem.

Edit 2: I changed the code block at the end to the following

testLength = 10000
goodL = rand(1,testLength)
goodR = rand(1,testLength)
for j = 1:3
    @time cauchyproduct(goodL,goodR)
end


for j = 1:3
    goodL = rand(1,testLength)
    goodR = rand(1,testLength)
    @time cauchyproduct(goodL,goodR)
end

@time cauchyproduct(goodL,goodR)
@time cauchyproduct(goodL,goodR)
@time cauchyproduct(goodL,goodR)

and got the following timings on 2 repeated executions of the new block.

Timing 1:

  0.045936 seconds (120.00 k allocations: 5.340 MiB)
  0.045740 seconds (120.00 k allocations: 5.340 MiB)
  0.045768 seconds (120.00 k allocations: 5.340 MiB)
  1.549157 seconds (120.00 k allocations: 5.340 MiB, 0.14% gc time)
  0.046797 seconds (120.00 k allocations: 5.340 MiB)
  0.046637 seconds (120.00 k allocations: 5.340 MiB)
  0.047143 seconds (120.00 k allocations: 5.341 MiB)
  0.049088 seconds (120.00 k allocations: 5.341 MiB, 3.88% gc time)
  0.049246 seconds (120.00 k allocations: 5.341 MiB)

Timing 2:

  2.250852 seconds (120.00 k allocations: 5.340 MiB)
  2.370882 seconds (120.00 k allocations: 5.340 MiB)
  2.247676 seconds (120.00 k allocations: 5.340 MiB, 0.14% gc time)
  1.550661 seconds (120.00 k allocations: 5.340 MiB)
  0.047258 seconds (120.00 k allocations: 5.340 MiB)
  0.047169 seconds (120.00 k allocations: 5.340 MiB)
  0.048625 seconds (120.00 k allocations: 5.341 MiB, 4.02% gc time)
  0.045489 seconds (120.00 k allocations: 5.341 MiB)
  0.049457 seconds (120.00 k allocations: 5.341 MiB)

So confused.

1
I have edited the original post to try and clarify the situation. - SDK
By a "run" I mean when I run the lower block with the for loop. Each of the 10 timings inside the for loop "usually" has similar timings. But rerunning that block of code (i.e. doing 10 additional timings) gives times which are all over the place. The block shown was running that for loop twice (total of 20 timings). - SDK
It is per run. Each group of 10 timings has similar values for each timing. However, running the block again gives 10 new timings with vastly different values (as you noticed with the two blocks of 10 above). - SDK
I'll admit the way the question is asked is a bit confusing, but I've no idea why you've been downvoted to -2. Overly harsh. I can duplicate these results on my machine, and the question is reasonable. It can be frustrating when downvoters don't explain their actions in the comments. I'm upvoting to try and get this back to 0 at least. - Colin T Bowers
Incidentally, your "good" and "bad" functions will compile down to the same code if the input can be guaranteed to always be Array{Float64,2}, which it can in your current tests, so you will see no performance difference between the two. See here for more detail. - Colin T Bowers

1 Answers

3
votes

Short answer: Your code is a bit odd, and so probably triggering garbage collection in unexpected ways, resulting in varied timings.

Long answer: I agree that the timings you are getting are a bit strange. I'm not completely sure I can nail down exactly what is causing the problem, but I'm 99% certain it is something to do with garbage collection.

So, your code is a bit odd, because you allow input arrays of any dimension, even though you then call the dot function (a BLAS routine for taking the dot product of two vectors). In case you didn't realise, if you want a vector, use Array{Float64,1}, and for a matrix Array{Float64,2} and so on. Or you could also uses the aliases Vector{Float64} and Matrix{Float64}.

The second odd thing I noticed is that in your test, you generate rand(1, N). This returns an Array{Float64,2}, i.e. a matrix. To get an Array{Float64, 1}, i.e. a vector, you would use rand(N). Then within your function you take views into your matrix, which are of size 1xN. Now, Julia uses column-major ordering, so using a 1xN object for a vector is going to be really inefficient, and is probably the source of your strange timings. Under the hood, I suspect the call to dot is going to involve converting these things into regular vectors of floats, since dot eventually feeds through to the underlying BLAS routine which will need this input type. All these conversions will mean plenty of temporary storage, which needs to be garbage collected at some point, and this will probably be the source of the varying timings (90% of the time, varied timings on the same code are the result of the garbage collector being triggered - and sometimes in quite unexpected ways).

So, there is probably several ways to improve the following, but my quick-and-dirty version of your function looks like this:

function cauchyproduct(L::AbstractVector{<:Number}, R::AbstractVector{<:Number})
    length(L) != length(R) && error("Length mismatch in inputs")
    N = length(L)
    prodterm = zeros(1,2*N-1)
    R = flipdim(R, 1)
    for n=1:N
        prodterm[n] = dot(view(L, 1:n), view(R, N-n+1:N))
    end
    for n = 1:N-1
        prodterm[N+n] = dot(view(L, n+1:N), view(R, 1:N-n))
    end
    return prodterm
end

Note, I flip R before the loop so the memory doesn't need to be re-ordered over and over within the loop. This was no doubt contributing to your strange garbage collection issues. Then, applying your test (I think it is a better idea to move array generation inside the loop in case some clever caching issue throws off timings):

testLength = 10000
for j = 1:20
    goodL = rand(testLength);
    goodR = rand(testLength);
    @time cauchyproduct(goodL,goodR);
end

we get something like this:

  0.105550 seconds (78.19 k allocations: 3.935 MiB, 2.91% gc time)
  0.022421 seconds (40.00 k allocations: 2.060 MiB)
  0.022527 seconds (40.00 k allocations: 2.060 MiB)
  0.022333 seconds (40.00 k allocations: 2.060 MiB)
  0.021568 seconds (40.00 k allocations: 2.060 MiB)
  0.021837 seconds (40.00 k allocations: 2.060 MiB)
  0.022155 seconds (40.00 k allocations: 2.060 MiB)
  0.022071 seconds (40.00 k allocations: 2.060 MiB)
  0.021720 seconds (40.00 k allocations: 2.060 MiB)
  0.024774 seconds (40.00 k allocations: 2.060 MiB, 9.13% gc time)
  0.021714 seconds (40.00 k allocations: 2.060 MiB)
  0.022066 seconds (40.00 k allocations: 2.060 MiB)
  0.021815 seconds (40.00 k allocations: 2.060 MiB)
  0.021819 seconds (40.00 k allocations: 2.060 MiB)
  0.021928 seconds (40.00 k allocations: 2.060 MiB)
  0.021795 seconds (40.00 k allocations: 2.060 MiB)
  0.021837 seconds (40.00 k allocations: 2.060 MiB)
  0.022285 seconds (40.00 k allocations: 2.060 MiB)
  0.021380 seconds (40.00 k allocations: 2.060 MiB)
  0.023828 seconds (40.00 k allocations: 2.060 MiB, 6.91% gc time)

The first iteration is measuring compile time, rather than runtime, and so should be ignored (if you don't know what I mean by this then check the performance tips section of the official docs). As you can see, the remaining iterations are much faster, and quite stable.