0
votes

I am trying to implement atomic functions in my opencl kernel. Multiple threads I am creating are parallely trying to write a single memory location. I want them to perform serial execution on that particular line of code. I have never used an atomic function before.

I found similar problems on many blogs and forums,and I am trying one solution.,i.e. use of two different functions 'acquire' and 'release' for locking and unlocking the semaphore. I have included necessary opencl extensions, which are all surely supported by my device (NVIDIA GeForce GTX 630M).

My kernel execution configuration:

global_item_size = 8;
ret = clEnqueueNDRangeKernel(command_queue2, kernel2, 1, NULL, &global_item_size2, &local_item_size2, 0, NULL, NULL);

Here is my code: reducer.cl

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable

typedef struct data
{
  double dattr[10];
  int d_id;
  int bestCent;
}Data;

typedef struct cent
{
  double cattr[5];
  int c_id;
}Cent;

__global void acquire(__global int* mutex)
{
    int occupied;
    do {
        occupied = atom_xchg(mutex, 1);
    } while (occupied>0);
}

__global void release(__global int* mutex)
{
    atom_xchg(mutex, 0); //the previous value, which is returned, is ignored
}

__kernel void reducer(__global int *keyMobj, __global int *valueMobj,__global Data *dataMobj,__global Cent *centMobj,__global int *countMobj,__global double *sumMobj, __global int *mutex)
{
  __local double sum[2][2];
  __local int cnt[2];

  int i = get_global_id(0);
  int n,j;

  if(i<2)
    cnt[i] = countMobj[i];
  barrier(CLK_GLOBAL_MEM_FENCE);

  n = keyMobj[i];
  for(j=0; j<2; j++)
  {
     barrier(CLK_GLOBAL_MEM_FENCE);
          acquire(mutex);
             sum[n][j] += dataMobj[i].dattr[j];
      release(mutex);
  }

  if(i<2)
  {
    for(j=0; j<2; j++)
    {
       sum[i][j] = sum[i][j]/countMobj[i];
       centMobj[i].cattr[j] = sum[i][j];
    }
  }
}

Unfortunately the solution doesn't seem like working for me. When I am reading back the centMobj into the host memory, using

ret = clEnqueueReadBuffer(command_queue2, centMobj, CL_TRUE, 0, (sizeof(Cent) * 2),  centNode, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue2, sumMobj, CL_TRUE, 0, (sizeof(double) * 2 * 2), sum, 0, NULL, NULL);

it is giving me error with error code = -5 (CL_OUT_OF_RESOURCES) for both centMobj and sumMobj.

I am not getting if there is any problem in my atomic function code or problem is in reading back data into the host memory. If I am using the atomic function incorrectly, please make me correct. Thank you in advance.

1

1 Answers

1
votes

In OpenCL, synchronization between work items can be done only inside a work-group. Code trying to synchronize work-items across different work-groups may work in some very specific (and implementation/device dependent) cases, but will fail in the general case.

The solution is to either use atomics to serialize accesses to the same memory location (but without blocking any work item), or redesign the code differently.