16
votes

I have some code that will calculate the distances between every Cartesian coordinate in one matrix to every other coordinate in another. For each coordinate, the minimum distance will be returned along with the index positions for the coordinates that produced the minimum.

function MED3D(m1, m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,3))
    @sync @distributed for k in 1:n1
        Dist[k,:] = MD3D(m1[k,:], m2, k)
    end
    return Dist
end

@everywhere function MD3D(v1, m2, k)
    dsum::Float64 = Inf
    dtemp::Float64 = Inf
    i = 0
    for j in 1:size(m2,1)
        @inbounds dtemp = sqrt((v1[1] - m2[j,1]) * (v1[1] - m2[j,1]) + (v1[2] - m2[j,2]) * (v1[2] - m2[j,2]) + (v1[3] - m2[j,3]) * (v1[3] - m2[j,3]))
        if dtemp < dsum
            dsum = dtemp
            i = j
        end
    end
    return [dsum, k, i]
end

m1 = rand(10,3)
m2 = rand(15,3)
results = MED3D(m1,m2)

While this works reasonably alright with smaller 3D point clouds, I am looking to increase the performance for large point clouds by using GPU-based analyses. However, using the more typical ways to do matrix operations in Julia seems not possible since I have to return the index positions and the minimum distance. I've tried several different ways to adopt CUarrays for this task, but so far they have all failed without using actual for loops. Also, a lot of ways to implement it appear exceptionally inefficient due to storing the distance matrix in memory, which quickly exceeds 128gb of ram for my particular data set.

Could someone please help me with how to properly implement this in Julia to operate on a GPU? Is CUarrays even the correct approach, or is it too abstract of a level given that I am returning indices in addition to a distance? I've tried to calculate the L2 norm using product and dot, but it's not exactly giving me what I require.

UPDATE:

Here is my failed attempt to GPUify the inner loop using broadcasting.

using CuArrays
function difff(m1,m2)
    n1 = size(m1,1)
    Dist = Array{Float64}(undef, n1,3)
    m2 = CuArray(m2)
    m1 = CuArray(m1)
    for z in 1:size(m1)
        v1 = transpose(m1[z,:])
        i = 0
        dsum::Float64 = Inf
        mi = v1 .- m2
        mi = mi .* mi
        mi = sum(mi, dims=2)
        mi = mi .^ 0.5
        mi = findmin(mi)
        i = mi[2][1]
        dsum = mi[1]
        @inbounds Dist[z,:] = [dsum,z,i]
    end
end

UPDATE:

Failed attempt #2. I tried to calculating the minimum distances and forgetting about the indices. This isn't ideal for my application, but I can live with it. However, this only works correctly if the first array has a single row. I tried to solve this by using mapslices, but it does not work.

using CuArray
a = rand(1,3)
b = rand(3,3)

a = CuArray(a)
b = CuArray(b)

function GK(m1, m2)
    reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end

mapslices(GK(b), a, 2)

UPDATE:

Making progress by using an outer loop, but surely there is a better way to do this?

using CuArray
using BenchmarkTools
aa = rand(2,3)
bb = rand(5000000,3)

a = CuArray(aa)
b = CuArray(bb)

function GK(m1, m2)
    reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end

function D(a,b)
    Dist = Array{Float64}(undef,size(a,1),1)
    for i in 1:size(a,1)
        Dist[i] = GK(a[i,:]',b)
    end
    return Dist
end

@benchmark test = D(a,b)
@benchmark test = D(aa,bb)

UPDATE:

Some benchmarking between my previous distributed version, modified distributed version, GPU version, and serial version.EDIT: After scaling to a 100 billion comparisons, the GPU version no longer outperforms my previous distributed version... Any thoughts on why this is????

using Distributed
using SharedArrays
using CuArrays
using BenchmarkTools

aa = rand(4,3)
bb = rand(500000,3)
a = CuArray(aa)
b = CuArray(bb)

function MED3D(m1, m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,1))
    @sync @distributed for k in 1:n1
        Dist[k] = MD3D(m1[k,:]', m2)
    end
    return Dist
end

@everywhere function MD3D(v1, m2)
    dsum::Float64 = Inf
    dtemp::Float64 = Inf
    for j in 1:size(m2,1)
        @inbounds dtemp = sqrt((v1[1] - m2[j,1]) * (v1[1] - m2[j,1]) + (v1[2] - m2[j,2]) * (v1[2] - m2[j,2]) + (v1[3] - m2[j,3]) * (v1[3] - m2[j,3]))
        if dtemp < dsum
            dsum = dtemp
        end
    end
    return dsum
end

function MED3DGK(m1, m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,1))
    @sync @distributed for k in 1:n1

        @inbounds Dist[k] = GK(m1[k,:]',m2)
    end
    return Dist
end

@everywhere function GK(m1, m2)
    reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end

function D(a,b)
    Dist = Array{Float64}(undef,size(a,1),1)
    for i in 1:size(a,1)
        @inbounds Dist[i] = GK(a[i,:]',b)
    end
    return Dist
end

@benchmark test = D(a,b)
@benchmark test = D(aa,bb)
@benchmark test = MED3D(aa,bb)
@benchmark test = MED3DGK(aa,bb)

UPDATE:

Implementation using NearestNeighbors.jl with distributed processing. Any thoughts on how to make this even faster?:

function MED3D(m1, m2)
    m2 = Matrix(m2')
    kdtree = KDTree(m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,1))
    Ind = SharedArray{Float64}((n1,1))
    @sync @distributed for k in 1:n1
        Ind[k,:], Dist[k,:] = knn(kdtree, m1[k,:], 1)
    end
    return [Ind,Dist]
end
1
@kevbonham I've looked into Distances.jl. The reason I haven't used it was the lack of parallel support. Given how much data I have to analyze, using a serial approach just isn't realistic. Also, from what I can see, they have a tendency to return all of the data rather than tossing out the unneeded distances, which causes a significant increase in memory allocation for my particular case. Returning an entire distance matrix is just crazy for me.JJL
Have you considered building NNTrees and use knn from NearestNeighbors.jl? I think it's pretty hard to beat for perf.Michael K. Borregaard
The NearestNeighbors package produces indices and exact distances - are you not conflating this with a different algorithm?Michael K. Borregaard
Well, I was definitely wrong! the NearestNeighbor package is producing identical results even when scaling to billions of comparisons and about a 10,000 fold increase in speed. I've been going about this all wrong.JJL
That's great that you managed to get such a huge speed-up! I'm wondering if there's a way to reformulate this question in a way that is useful and answerable. Having all of your updates is great documentation, but SO really depends on Questions and answers. This is a bit of a journey instead (maybe would be good to reproduce over at discourse.julialang.org?kevbonham

1 Answers

1
votes

I'm not sure if it will help in your case, but when taking a slice m1[k,:] by default julia copies that memory (although maybe it depends on what the knn function does with that slice.

Does it improve anything if you change it to knn(kdtree, @view m1[k,:], 1)