I'm pretty new in ocl programming.
I have 2 million polygons (4 lines each - for this examples - but it is variant) that i need to find intersection with 1000 ellipses.
I need to know which if each ellipse intersected at least with one of the polygons.
For that i've created a poylgons buffer with all the points, and ellipses buffer.
My output buffer is a 1000 items int set to 0 in all items. and the kernel will set 1 (when he found intersection) in the right index according to the ellipse index.
I ran the kernel with a global - 2 dim, {2million, 1000}.
__kernel void polygonsIntersectsEllipses( __global const Point* pts,
__global const Ellipse* ellipses,
__global int* output)
{
int polygonIdx = get_global_id(0);
int ellipseIdx = get_global_id(1);
if (<<isIntersects>>) {
output[ellipseIdx] = 1;
}
}
The problem is that once one of the polygons intersects the ellipse i don't need to calculate the rest of the polygons.
I've tried to check output[ellipseIdx] != 0 before intersection test, but the performance didn't change so much.
I've tried to do single dim global - giving the 1000 (ellipses) and in the kernel run over the millions polygons and stop when i've found one, but still not so much change.
Am i doing something wrong ? can i speed this operation up ? any tips ?
EDIT
Using the tip from @Moises and doing allot of research I've change my code to run 2 million times, single dimension. using group work items. changed all my structs to native types, skipped modulus operation. And basically where i could copy data from the global to the private/local memory, i did it.
My local size is my device CL_DEVICE_MAX_WORK_GROUP_SIZE, in my cpu&gpu it is 1024, so in a single run i'm covering all my 1000 ellipses.
Host side
size_t global = 1999872; // 2 million divided by 1024 - for the test
size_t local = 1024;
My code look like this now
__kernel void polygonsIntersectsEllipses( __global const float4* pts,
__global const float4* ellipses,
int ellipseCount,
__local float4* localEllipses,
__global int* output)
{
// Saving the ellipses to local memory
int localId = get_local_id(0);
if (localId < eCount)
localEllipses[localId] = ellipses[localId];
barrier(CLK_LOCAL_MEM_FENCE);
// Saving the current polygon into private memory
int polygonIdx = get_global_id(0);
float2 private_pts[5];
for (int currPtsIdx = 0; currPtsIdx < 4; currPtsIdx++)
{
private_pts[currPtsIdx] = pts[polygonIdx * 4 + currPtsIdx];
}
// saving the last point as first so i will not use modulus for cycling, in the intersection algorithm
private_pts[4] = private_pts[0];
// Run over all the ellipse in the local memory including checking if already there is an output
for (int ellipseIdx = 0; ellipseIdx < ellipseCount && output[ellipseIdx] == 0; ++ellipseIdx) {
if (<<isIntersects Using localEllipses array and private_pts>>) {
output[ellipseIdx] = 1;
}
}
}
The results
CPU was not so much improved - 1.1 faster after change.
GPU - 6.5 times faster (i'm thrilled)
Is there any place i can improve even more ? Reminder, once one of the polygons intersects the ellipse, we don't need to check with the rest of the polygons. How do i do that ? my trick with asking the output value is not really working - the performances are the same with or without it