0
votes

I have been trying to implement a simple parallel algorithm using OpenCL c++ bindings (version 1.2). Roughly here is the c code (no OpenCL):

typedef struct coord{
    double _x;
    double _y;
    double _z;
}__coord;

typedef struct node{
    __coord _coord;
     double _dist;
} __node;
double input[3] = {-1.0, -2, 3.5};
//nodeVector1D is a 1Dim random array of struct __node
//nodeVectorSize is the Size of the above array (>1,000)
double d = 0.0;
     for(int i=0; i < nodeVectorSize; i++){
         __node n = nodeVector1D[i];
         d += (input[0] - n._coord._x)*(input[0] - n._coord._x);
         d += (input[1] - n._coord._y)*(input[1] - n._coord._y);
         d += (input[2] - n._coord._z)*(input[2] - n._coord._z);
         n._dist = d;
     }

I use a MacBook Pro 13" Late 2013, running on Mac Os X Lion. OpenCL only detects the CPU. The CPU: an Intel Ivy i5 2.6GHz, has an integrated GPU of 1Gb at 1.6Ghz (Intel HD Graphics 4000). The maximum detected Group Item Size is 1024 bytes. When I run the flat code above (with 1024 nodes), it takes around 17 micro seconds.+

When I run its parallel version using OpenCL, C++ library, it takes 10 times as long, around 87 micro seconds (excluding the program creation, buffer allocation and writing). What am I doing wrong here?

NB: the OpenCL kernel for this algorithm is obvious to guess, but I can post it if needed. Thanks in advance.

EDIT N#1: THE KERNEL CODE

__kernel void  _computeDist(
   __global void* nodeVector1D,
   const unsigned int nodeVectorSize,
   const unsigned int itemsize, 
   __global const double* input){
    double d = 0.;
    int i,c;
    double* n;
    i = get_global_id(0);
    if (i >= nodeVectorSize) return;
    n = (double*)(nodeVector1D + i*itemsize);
    for (c=0; c<3;c++){
        d += (input[c] - n[c])*(input[c] - n[c]);
    }
    n[3] = d;

}

Sorry for the void pointer arithmetic, but it works (no seg default). I can also post the OpenCL initialization routine, but I think it's all over the Internet. However, I will post it, if someone asks.

@pmdj: As I said above OpenCL recognizes my CPU, otherwise I wouldn't have been able to run the tests and get the performance results presented above.

@pmdj: OpenCL kernel code, to my knowledge are always written in C. However, I tagged C++ because (as I said above), I'm using the OpenCL C++ bindings.

1
Perhaps you should share the code you're having trouble with instead of the "C++ version". - Havenard
10 x 17 is 170, not 87. Is this performance critical code? If so, you'll need to give a minimum, complete example so we can reproduce the problem. - tadman
A few things: (1) The Intel HD Graphics 4000 is supported by the macOS OpenCL implementation, so if it's not showing up in your device enumeration, you've got a problem with your code right there. You haven't shared your code so there's not much I can say beyond that. Note that OS X 10.7 is long out of support and has various security issues so you really shouldn't be using it anymore. (2) If your data set is only a one-off calculation with 1000 items, that's probably too small to see the advantages of OpenCL. Your startup and coordination overhead will exceed your actual algorithm runtime. - pmdj
(3) Your kernel code isn't actually equivalent to the C++ code: surely you want input[c] - n[c], not input[0] - n[0] 3 times in a row. - pmdj

1 Answers

0
votes

I finally found the issue. The problem was that OpenCL on Mac OS X returns the wrong maximum device work group size of 1024. I tested with various work group sizes and ended up having 200% performance gains when using a work group size of 128 work items per group. Here is a clearer benchmark picture. IGPU stands for Integrated GPU. (X-Axis: the array size, Y-Axis: The Time Duration in microseconds) enter image description here