How to execute atomic write in CUDA?

Question

First of all I cannot find reliable source whether the write is atomic in CUDA or not. For example Is global memory write considered atomic in CUDA? touches this subject but the last remark shows we are not talking about same atomic notion. Having the code:

global_mem[0] = pick_at_random_from(1, 2);
shared_mem[0] = pick_at_random_from(1, 2);

executed by gazillion of threads "atomic" means in both cases the content will be 1 or 2 and it is guaranteed nothing else can show up (like 3). Atomic means integrity.

But as I understand it, CUDA does not guarantee it, so when I run this code I can potentially get value 3? If it really the case, how to perform atomic write? There is atomicExch but it is an overkill -- it does more than it is needed.

Atomic functions I already checked: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

Atomic operations are, as the documentation says, "read-modify-write operations" in CUDA. The definition used for CUDA is "The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads". I think (not 100% sure) that you are ensured to get 1,2 in the code you showed, you just do not know which kernel wrote it due to race conditions — Ander Biguri
@AnderBiguri, do you quote the part I linked? If yes, the beginning of the sentence states about functions not operations, thus I believe this read-modify-write should be read a sequence not a pool, and they are referring to the listed functions below (in the doc). — astrowalker
no, you can't get 3, you will get either 1 or 2, assuming the writes you are doing are locationally consistent and naturally aligned across threads, and this has been covered elsewhere (multiple questions here on the cuda tag, such as this one) Your question is maybe a duplicate of that one. — Robert Crovella
If you want a formal statement of the memory consistency model in CUDA, as opposed to my claims, you would need to parse through the memory model definition given in the PTX manual — Robert Crovella
@RobertCrovella, thank you, but I checked your answers, once you write the writes are atomic, on on the other answer you write that writes are NOT atomic. For first, write cannot be atomic and not atomic at the same time, for second, with such contradiction I still don't know whether they are atomic :-) — astrowalker

Robert Crovella Robert Crovella · Accepted Answer · 2018-10-18T15:20:18

For a write operation in each of 2 different threads in CUDA, if:

the writes are to the same location (address)
that address is naturally aligned for the size of the write
the size of the write operation is the same between each of the two threads (and is of size 1, 2, 4, or 8 bytes)

then you are guaranteed to get one of the values written by those two threads, and not any other value, considering the data type size that was written. This is provided so long as the write is done by a single SASS instruction. The correctness here is provided by current CUDA hardware, not necessarily the compiler, the CUDA programming model, and/or the C++ standard to which CUDA adheres.

This is directly extendable to any number of threads that meet the above conditions.

This assumes no other threads are doing "anything else" with respect to the written locations (i.e. they are not writing a different size quantity to that location, or any overlapping location, or of some other alignment).

Which actual value will end up in that location is generally undefined (except that it will be one and only one of the written values, and not anything else) unless the programmer enforces some ordering on the operations.

When writing vector quantities or structures in C/C++, care should be taken to ensure that the underlying write (store) instruction in SASS code references the appropriate size. The comments above when referring to write operations are referring to the writes as issued by the SASS code. Generally speaking, I don't expect much difference between that interpretation and "writes from C/C++ code" using POD data types. But structures could possibly be broken into multiple transactions of a smaller size, in which case the above statements would be abrogated. Nevertheless, it's possible with appropriate programming practices (e.g. careful use of vector types) in C/C++ to ensure that up to 8 byte writes will be used if relevant.

How to execute atomic write in CUDA?

1 Answers