2
votes

How can I write a statement in my CUDA kernel that is executed by a single thread. For example if I have the following kernel:

__global__ void Kernel(bool *d_over, bool *d_update_flag_threads, int no_nodes)
{
   int tid = blockIdx.x*blockDim.x + threadIdx.x;
   if( tid<no_nodes && d_update_flag_threads[tid])
   {
     ...
     *d_over=true; // writing a single memory location, only 1 thread should do?
     ...
   }
}

In above kernel, "d_over" is a single boolean flag while "d_update_flag_threads" is a boolean array.

What I normally did before is using the first thread in the thread block e.g.:

if(threadIdx.x==0)

but It could not work in this case as I have a flag array here and only threads with assosiated flag "true" will execute the if statement. That flag array is set by another CUDA kernel called before and I don't have any knowledge about it in advance.

In short, I need something similar to "Single" construct in OpenMP.

2
Why not simply terminate the current if, create a new if(threadIdx.x == 0) for the assignment, and then resume control with a new if? - Jared Hoberock

2 Answers

3
votes

A possible approach is use atomic operations. If you need only one thread per block to do the update, you could do the atomic operation in shared memory (for compute capability >= 1.2) which is generally much faster than perform it in global memory.

Said that, the idea is as follow:

int tid = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ int sFlag;
// initialize flag
if (threadIdx.x == 0) sFlag = 0;
__syncthreads();

if( tid<no_nodes && d_update_flag_threads[tid])
{
  // safely update the flag
  int singleFlag = atomicAdd(&sFlag, 1);
  // custom single operation
  if ( singleFlag == 0) 
      *d_over=true; // writing a single memory location, only 1 thread will do it
       ...
}

It is just an idea. I've not tested it but is close to an operation performed by a single thread, not being the first thread of the block.

0
votes

You could use atomicCAS(d_over, 0, 1) where d_over is declared or type-cast as int*. This would ensure that only the first thread that sees the d_over value to be 0 (false) would update it and nobody else would.