Prevent calculation twice in opencl

Question

I'm pretty new in ocl programming.

I have 2 million polygons (4 lines each - for this examples - but it is variant) that i need to find intersection with 1000 ellipses.

I need to know which if each ellipse intersected at least with one of the polygons.

For that i've created a poylgons buffer with all the points, and ellipses buffer.

My output buffer is a 1000 items int set to 0 in all items. and the kernel will set 1 (when he found intersection) in the right index according to the ellipse index.

I ran the kernel with a global - 2 dim, {2million, 1000}.

__kernel void polygonsIntersectsEllipses(   __global const Point* pts, 
                                            __global const Ellipse* ellipses, 
                                            __global int* output)
{
    int polygonIdx = get_global_id(0);
    int ellipseIdx = get_global_id(1);

    if (<<isIntersects>>) {
        output[ellipseIdx] = 1;
    }
}

The problem is that once one of the polygons intersects the ellipse i don't need to calculate the rest of the polygons.

I've tried to check output[ellipseIdx] != 0 before intersection test, but the performance didn't change so much.

I've tried to do single dim global - giving the 1000 (ellipses) and in the kernel run over the millions polygons and stop when i've found one, but still not so much change.

Am i doing something wrong ? can i speed this operation up ? any tips ?

EDIT

Using the tip from @Moises and doing allot of research I've change my code to run 2 million times, single dimension. using group work items. changed all my structs to native types, skipped modulus operation. And basically where i could copy data from the global to the private/local memory, i did it.

My local size is my device CL_DEVICE_MAX_WORK_GROUP_SIZE, in my cpu&gpu it is 1024, so in a single run i'm covering all my 1000 ellipses.

Host side

size_t global = 1999872; // 2 million divided by 1024 - for the test
size_t local = 1024;

My code look like this now

 __kernel void polygonsIntersectsEllipses(  __global const float4* pts, 
                                            __global const float4* ellipses, 
                                            int ellipseCount,
                                            __local float4* localEllipses,
                                            __global int* output)
{
    // Saving the ellipses to local memory
    int localId = get_local_id(0);
    if (localId < eCount)
        localEllipses[localId] = ellipses[localId];

    barrier(CLK_LOCAL_MEM_FENCE);

    // Saving the current polygon into private memory
    int polygonIdx = get_global_id(0);
    float2 private_pts[5];
    for (int currPtsIdx = 0; currPtsIdx < 4; currPtsIdx++)
    {
        private_pts[currPtsIdx] = pts[polygonIdx * 4 + currPtsIdx];
    }

    // saving the last point as first so i will not use modulus for cycling, in the intersection algorithm
    private_pts[4] = private_pts[0];

    // Run over all the ellipse in the local memory including checking if already there is an output
    for (int ellipseIdx = 0; ellipseIdx < ellipseCount && output[ellipseIdx] == 0; ++ellipseIdx) {
       if (<<isIntersects Using localEllipses array and private_pts>>) {
           output[ellipseIdx] = 1;
       }
    }
}

The results

CPU was not so much improved - 1.1 faster after change.

GPU - 6.5 times faster (i'm thrilled)

Is there any place i can improve even more ? Reminder, once one of the polygons intersects the ellipse, we don't need to check with the rest of the polygons. How do i do that ? my trick with asking the output value is not really working - the performances are the same with or without it

Moises Viñas Moises Viñas · Accepted Answer · 2015-12-30T15:06:59

I understand that all of your 2millionx1000 threads, read its own polygon data and ellipse right? So, for each polygon, each thread reads 1000 times the same memory position (with the polygon data), doesn't it? To avoid this memory bound behavior, you can create only 2million threads and using a for loop of 1000 iterations to iterate over the number of ellipses. Or an intermediate solution, having a mesh of 2millionx64 threads where each thread computes 16 ellipses for each polygon. I do not know if these are better than your solution but they avoid the superfluous memory accesses.

Regards, Moisés

Prevent calculation twice in opencl

2 Answers