2
votes

As a starting point in Julia I decided to implement a simple Strassen product:

@inbounds function strassen_product(A :: Array{Num, 2}, B :: Array{Num, 2}, k = 2) :: Array{Num, 2} where {Num <: Number}

    A_height, A_width = size(A)
    B_height, B_width = size(B)

    @assert A_height == A_width == B_height == B_width  "Matrices are noth both square or of equal size."
    @assert isinteger(log2(A_height))                   "Size of matrices is not a power of 2."

    if A_height ≤ k
        return A * B
    end

    middle = A_height ÷ 2

    A₁₁, A₁₂ =  A[1:middle, 1:middle],     A[1:middle, middle+1:end]
    A₂₁, A₂₂ =  A[middle+1:end, 1:middle], A[middle+1:end, middle+1:end]

    B₁₁, B₁₂ = B[1:middle, 1:middle],     B[1:middle, middle+1:end]
    B₂₁, B₂₂ = B[middle+1:end, 1:middle], B[middle+1:end, middle+1:end]

    P₁ = strassen_product(A₁₁ + A₂₂, B₁₁ + B₂₂)
    P₂ = strassen_product(A₂₁ + A₂₂, B₁₁    )
    P₃ = strassen_product(A₁₁,       B₁₂ - B₂₂)
    P₄ = strassen_product(A₂₂,       B₂₁ - B₁₁)
    P₅ = strassen_product(A₁₁ + A₁₂, B₂₂      )
    P₆ = strassen_product(A₂₁ - A₁₁, B₁₁ + B₁₂)
    P₇ = strassen_product(A₁₂ - A₂₂, B₂₁ + B₂₂)

    C₁₁ = P₁ + P₄ - P₅ + P₇
    C₁₂ = P₃ + P₅
    C₂₁ = P₂ + P₄
    C₂₂ = P₁ + P₃ - P₂ + P₆

    return [ C₁₁ C₁₂ ;
             C₂₁ C₂₂ ]

end

All well and good. Actually I like the whole idea of unsafe-ish optimizations like @inbounds that actually do affect performance by a large margin, not a few milliseconds at best.

Now, to optimize even further since I have no for loops would be to use views for those A₁₁, etc. matrices so no copying takes place.

So I slapped @views in front of the 4 lines containing indexing. Of course I got an error because a couple lines later the recursive calls want Array{...} arguments not SubArray{...}. So I changed the type of the arguments and the return type to AbstractArray{Num, 2}. This time it worked since AbstractArray is a base type for array types but... the performance took a plunge off a rock, literally 10 times slower and a ton more allocations.

My test case is this:

A = rand(1:10, 4, 4)
B = rand(1:10, 4, 4)

@time C = strassen_product(A, B)

When using @views + AbstractArray:

0.457157 seconds (1.96 M allocations: 98.910 MiB, 5.56% gc time)

When using the no-view version:

0.049756 seconds (126.92 k allocations: 5.603 MiB)

The difference is huge, the version that is supposed to be faster is 9-10x slower, has about 15x more allocations ans almost 20x the space of the other version.

EDIT: This is not the first run for either case but the most "median" value for ~10 test runs. Not the first run and certainly not a min or max peak.

EDIT: I am using version 1.0.

My question is: why does this happen? What am I not getting? Is my reasoning that using views instead of copies would be faster... wrong?

1
It looks like your timings are from the first run: in this case the dominant cost is the JIT compilation. Try timing after running the function first run.Simon Byrne
Are you running @time just once? The first time it JIT compiles, notice how subsequent calls are faster. I recommend using @btime from the package BenchmarkTools for these micro benchmarks instead.HarmonicaMuse
Timing subsequent runs, I still get roughly the same timings. Note that you are still allocating results for the output: try doing the whole thing in-place. Also type specifiers in the function arguments won't affect performance. See docs.julialang.org/en/v1/manual/performance-tips for more details.Simon Byrne
@SimonByrne they are not from the first run, the are the the more 'median' value of 5-10 runsRares Dima
What Julia version are you using? I get almost identical timings (slight advantage for the @views version) for both versions on Julia 1.0.fredrikekre

1 Answers

2
votes

Yes, view version takes longer than the copy version but allocates less memory. Here is why.

Using views instead of copies does not necessarily mean a speed-up (although it decreases memory allocations.) The speed-up depends on many things in your program. First of all, each view you create is for a tile of a matrix. Matrices in Julia are stored column-major in memory which makes retrieving two columns from memory an easier job for CPU than retrieving two rows, since elements of columns are stored contiguously.

A tile of a matrix is not stored contiguously in memory. Creating a copy of a tile accesses each necessary elements in the matrix and writes them into a contiguous block in memory, whereas creating a view on a tile stores only the limits . Although creating copy takes much more time and more memory accesses than creating a view on a tile, the copy is stored contiguously in memory which means easier access, easier vectorization and easier caching for CPU for later uses.

You are accessing the tiles you created more than once and each new access comes after a full recursive call each of which takes some time and memory accesses so that the tiles you already loaded into cache may get lost. Therefore each new access on the same tile may require a full reload from memory. But a full reload of a view tile requires more time than a full reload of a copy tile. That is why the view version takes longer than the copy version although the view version allocates less memory and uses fewer number of accesses.

Take a look at the performance tips at the documentation, both Consider using views for slices, Copying data is not always bad.

Arrays are stored contiguously in memory, lending themselves to CPU vectorization and fewer memory accesses due to caching. These are the same reasons that it is recommended to access arrays in column-major order (see above). Irregular access patterns and non-contiguous views can drastically slow down computations on arrays because of non-sequential memory access.

Copying irregularly-accessed data into a contiguous array before operating on it can result in a large speedup, such as in the example below. Here, a matrix and a vector are being accessed at 800,000 of their randomly-shuffled indices before being multiplied. Copying the views into plain arrays speeds up the multiplication even with the cost of the copying operation...

That said, however, the difference should not be as big as your results suggest. In addition, the product of 4x4 matrix should not even take a millisecond. I believe in each call of your function you also reload your function definitions which makes the JIT compiled version out-of-date and Julia has to compile your function again and again. It is also possible that you are creating a situation where your new function is not type stable. I suggest that you use @btime in BenchmarkTools for measuring allocations and time.

You should also test with larger matrices. Creating a 2x2 view on a 4x4 array still writes data to memory that is comparable in size to a 2x2 copy. Timing with small data is also prone to noise.

@inbounds function strassen_product(A, B, k = 2)

     A_height, A_width = size(A)
#     B_height, B_width = size(B)

#     @assert A_height == A_width == B_height == B_width  "Matrices are noth both square or of equal size."
#     @assert isinteger(log2(A_height))                   "Size of matrices is not a power of 2."

     if A_height ≤ k
         return A * B
     end

    middle = A_height ÷ 2

    A₁₁, A₁₂ =  @view(A[1:middle, 1:middle]), @view(A[1:middle, middle+1:end])
    A₂₁, A₂₂ =  @view(A[middle+1:end, 1:middle]), @view(A[middle+1:end, middle+1:end])
    B₁₁, B₁₂ = @view(B[1:middle, 1:middle]),     @view(B[1:middle, middle+1:end])
    B₂₁, B₂₂ = @view(B[middle+1:end, 1:middle]), @view(B[middle+1:end, middle+1:end])

    P₁ = strassen_product(A₁₁ + A₂₂, B₁₁ + B₂₂)
    P₂ = strassen_product(A₂₁ + A₂₂, B₁₁    )
    P₃ = strassen_product(A₁₁,       B₁₂ - B₂₂)
    P₄ = strassen_product(A₂₂,       B₂₁ - B₁₁)
    P₅ = strassen_product(A₁₁ + A₁₂, B₂₂      )
    P₆ = strassen_product(A₂₁ - A₁₁, B₁₁ + B₁₂)
    P₇ = strassen_product(A₁₂ - A₂₂, B₂₁ + B₂₂)

    C₁₁ = P₁ + P₄ - P₅ + P₇
    C₁₂ = P₃ + P₅
    C₂₁ = P₂ + P₄
    C₂₂ = P₁ + P₃ - P₂ + P₆

    return [ C₁₁ C₁₂ ;
             C₂₁ C₂₂ ]

end
@inbounds function strassen_product2(A, B, k = 2)

    A_height, A_width = size(A)
    #B_height, B_width = size(B)

    #@assert A_height == A_width == B_height == B_width  "Matrices are noth both square or of equal size."
    #@assert isinteger(log2(A_height))                   "Size of matrices is not a power of 2."

    if A_height ≤ k
        return A * B
    end

    middle = A_height ÷ 2

    A₁₁, A₁₂ =  A[1:middle, 1:middle], A[1:middle, middle+1:end]
    A₂₁, A₂₂ =  A[middle+1:end, 1:middle], A[middle+1:end, middle+1:end]

    B₁₁, B₁₂ = B[1:middle, 1:middle],     B[1:middle, middle+1:end]
    B₂₁, B₂₂ = B[middle+1:end, 1:middle], B[middle+1:end, middle+1:end]

    P₁ = strassen_product2(A₁₁ + A₂₂, B₁₁ + B₂₂)
    P₂ = strassen_product2(A₂₁ + A₂₂, B₁₁    )
    P₃ = strassen_product2(A₁₁,       B₁₂ - B₂₂)
    P₄ = strassen_product2(A₂₂,       B₂₁ - B₁₁)
    P₅ = strassen_product2(A₁₁ + A₁₂, B₂₂      )
    P₆ = strassen_product2(A₂₁ - A₁₁, B₁₁ + B₁₂)
    P₇ = strassen_product2(A₁₂ - A₂₂, B₂₁ + B₂₂)

    C₁₁ = P₁ + P₄ - P₅ + P₇
    C₁₂ = P₃ + P₅
    C₂₁ = P₂ + P₄
    C₂₂ = P₁ + P₃ - P₂ + P₆

    return [ C₁₁ C₁₂ ;
             C₂₁ C₂₂ ]

end

Here are the tests with @btime in Benchmarktools

A = rand(1:10, 256, 256)
B = rand(1:10, 256, 256)

@btime C = strassen_product(A, B); # view version
@btime D = strassen_product2(A, B); #copy version

Results:

438.294 ms (4941454 allocations: 551.53 MiB)
349.894 ms (4529747 allocations: 620.04 MiB)