18
votes

I have several blocks, each having some integers in a shared memory array of size 512. How can I check if the array in every block contains a zero as an element?

What I am doing is creating an array that resides in the global memory. The size of this array depends on the number of blocks, and it is initialized to 0. Hence every block writes to a[blockid] = 1 if the shared memory array contains a zero.

My problem is when I have several threads in a single block writing at the same time. That is, if the array in the shared memory contains more than one zero, then several threads will write a[blockid] = 1. Would this generate any problem?

In other words, would it be a problem if 2 threads write the exact same value to the exact same array element in global memory?

4

4 Answers

23
votes

For a CUDA program, if multiple threads in a warp write to the same location then the location will be updated but it is undefined how many times the location is updated (i.e. how many actual writes occur in series) and it is undefined which thread will write last (i.e. which thread will win the race).

For devices of compute capability 2.x, if multiple threads in a warp write to the same address then only one thread will actually perform the write, which thread is undefined.

From the CUDA C Programming Guide section F.4.2:

If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.

See also section 4.1 of the guide for more info.

In other words, if all threads writing to a given location write the same value, then it is safe.

14
votes

In the CUDA execution model, there are no guarantees that every simultaneous write from threads in the same block to the same global memory location will succeed. At least one write will work, but it isn't guaranteed by the programming model how many write transactions will occur, or in what order they will occur if more than one transaction is executed.

If this is a problem, then a better approach (from a correctness point of view), would be to have only one thread from each block do the global write. You can either use a shared memory flag set atomically or a reduction operation to determine whether the value should be set. Which you choose might depend on how many zeros there are likely to be. The more zeroes there are, the more attractive the reduction will be. CUDA includes warp level __any() and __all() operators which can be built into a very efficient boolean reduction in a few lines of code.

0
votes

Yes, it will be a problem called as Race Condition.
You should consider synchronizing access to the global data through process Semaphores

0
votes

While not a mutex or semaphore, CUDA does contain a synchronization primative you can utilize for serializing access to a given code segment or memory location. Through the __syncthreads() function, you can create a barrier so that any given thread blocks at the point of the command call until all the threads in a given block have executed the __syncthreads() command. That way you can hopefully serialize access to your memory location and avoid a situation where two threads need to write to the same memory location at the same time. The only warning is that all the threads have to at some point execute __syncthreads(), or else you will end up with a dead-lock situation. So don't place the call inside some conditional if-statement where some threads may never execute the command. If you do approach your problem like this, there will need to be some provision made for the threads that don't initially call __syncthreads() to call the function later in order to avoid deadlock.