Efficiently find minimum of large array using Opencl

Question

I am working on the implementation of a hierarchical clustering algorithm in opencl. For each step, I have find the minimum value in a very large array (approx. 10^8 entries) so that I know which elements have to be combined into a new cluster. The identification of the minimum value must be done 9999 times. With my current kernels, it takes about 200 seconds to find the minimum value (accumulated over all iterations). How I approached the problem is by dividing the array into 2560 equally sized fragments (there are 2560 stream processors on my Radeon 7970) and to find the minimum of each fragment individually. The I run a second kernel that combines these minima into a global minimum.

It there any more efficient way to approach this problem? The initial idea was to speed up HCA by using OpenCL but with the amount of time that the identification of the minimum takes it is much longer than the matlab HCA on the CPU. What am I doing wrong?

__kernel void findMinValue(__global float * myArray, __global double * mins, __global int * elementsToWorkOn, __global int * arraysize){
int gid = get_global_id(0);
int minloc = 0;
float mymin = INFINITY;
int eltoWorkOn = *elementsToWorkOn;
int offset = gid*eltoWorkOn;
int target = offset + eltoWorkOn;

if (offset<*arraysize){
    //make sure the array size is not exceeded
    if (target > *arraysize){
        target = *arraysize;
    }

    //find minimum for the kernel
    for (int i = offset; i < target; i++){
        if (*(myArray + i) < mymin){
            mymin = *(myArray + i);
            minloc = i;
        }
    }
}
*(mins + gid * 2) = minloc;
*(mins + gid * 2 + 1) = mymin;
}


__kernel void getGlobalMin(__global double * mins, __global double * gmin, __global int * pixelsInImage){
    int nWorkitems = 2560;
    float globalMin = INFINITY;
    double globalMinLoc;
    float tempMin;
    for (int i = 0; i < nWorkitems; i++){
        tempMin = *(mins + 2 * i + 1);
        if (tempMin < globalMin){
            globalMin = tempMin;
            globalMinLoc = *(mins + 2 * i);
        }
    }
    *(gmin + 0) = globalMinLoc;
    *(gmin + 1) = globalMin;
}

UPDATE

I redesigned the findMinValue Kernel based on your suggestions. The memory access is now coalescent and I divided the work into work groups, so that I can reduce the amount of global memory accesses. Before, every kernel wrote its minimum value to the global mins buffer. Now only one kernel per worg group writes one value (i.e. the group minimum). Furthermore, I increased the global work size in order to hide memory latency.

These changes allowed to reduce the time required for identifying the minima from >200s to only 59s! Thank you very much for your help!

Is there anything else I could have missed while optimizing the kernel? Do you have any further suggestions? I could not figure out how to use setArg(). Do I have to pass a pointer to the int value to it (like this: err = clSetKernelArg(kernel[2], 3, sizeof(int), &variable);). How woudl the kernel declaration look in this case?

Here is my new Kernel:

__kernel void findMinValue(__global float * myArray, __global double * mins, __global int * arraysize,__global int * elToWorkOn,__global int * dummy){
int gid = get_global_id(0);
int lid = get_local_id(0);
int groupID = get_group_id(0);
int lsize = get_local_size(0);
int gsize = get_global_id(0);
int minloc = 0;
int arrSize = *arraysize;
int elPerGroup = *elToWorkOn;
float mymin = INFINITY;


__local float lmins[128];
//initialize local memory
*(lmins + lid) = INFINITY;
__local int lminlocs[128];

//this private value will reduce global memory access in the for loop (temp = *(myArray + i);)
float temp;

//ofset and target of the for loop
int offset = elPerGroup*groupID + lid;
int target = elPerGroup*(groupID + 1);

//prevent that target<arrsize (may happen due to rounding errors or arrSize not a multiple of elPerGroup
target = min(arrSize, target);

//find minimum for the kernel
//offset is different for each lid, leading to sequential memory access
if (offset < arrSize){
    for (int i = offset; i < target; i += lsize){
        temp = *(myArray + i);
        if (temp < mymin){
            mymin = temp;
            minloc = i;
        }
    }

    //store kernel minimum in local memory
    *(lminlocs + lid) = minloc;
    *(lmins + lid) = mymin;

    //find work group minimum (reduce global memory accesses)
    lsize = lsize >> 1;
    while (lsize > 0){
        if (lid < lsize){
            if (*(lmins + lid)> *(lmins + lid + lsize)){
                *(lmins + lid) = *(lmins + lid + lsize);
                *(lminlocs + lid) = *(lminlocs + lid + lsize);
            }
        }
        lsize = lsize >> 1;
    }
}
//write group minimum to global buffer
if (lid == 0){
    *(mins + groupID * 2 + 0) = *(lminlocs + 0);
    *(mins + groupID * 2 + 1) = *(lmins + 0);
}
}

What kind of changes does the array experience between each search for the minimum? A solution may involve bookkeeping those changes instead of starting each search anew. — Drew Dormann
I already thought about that but didn't come up with a smart solution. In fact, only two rows and columns are changed with each iteration. The matrix itself is 10000x10000 items large. If you know any good method to avoid having to iterate over every single item with each iteration I would be very happy. This would also be very helpful because the distance matrix will contain more and more INFINITY values the more iterations have passed and these values do not need to be considered at all. I couldn't figure out a smart way to identify the global minimum and its position :-/ — mTORjaeger

Dithermaster Dithermaster · Accepted Answer · 2014-06-17T19:46:32

If each work item iterates through a global array there is ZERO coalescing of reads. If you change it so each work items strides by the warp or wavefront size then you'd get a huge speed gain.

Efficiently find minimum of large array using Opencl

3 Answers