4
votes

Currently I develop a GPU-based program that use multiple kernels that are launched concurrently by using multiple streams.

In my application, multiple kernels need to access a queue/stack and I have plan to use atomic operations.

But I do not know whether atomic operations work between multiple kernels concurrently launched. Please help me anyone who know the exact mechanism of the atomic operations on GPU or who has experience with this issue.

1
Have you tried anything? This isn't my area at all but if you've attempted something other users will be able to help better with some code :)Andy Holmes
From the CUDA C Programming Guide: An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory. [...] The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads. From the CUDA C Best Practices Guide: On devices that are capable of concurrent kernel execution, streams can also be used to execute multiple kernels simultaneously to more fully take advantage of the device's multiprocessors-Vitality
Putting together the two things, I would say that, if you launch different kernels working on independent data in different streams and each kernel uses atomic operations, then the operations within each kernel can be "serialized", but the kernels can still run concurrently exploiting the different resources available in a GPU (cores, Load/Store units, and Special Function Units).Vitality

1 Answers

6
votes

Atomics are implemented in the L2 cache hardware of the GPU, through which all memory operations must pass. There is no hardware to ensure coherency between host and device memory, or between different GPUs; but as long as the kernels are running on the same GPU and using device memory on that GPU to synchronize, atomics will work as expected.