Cuda: async-copy vs coalesced global memory read atomicity

Question

I was reading something about the memory model in Cuda. In particular, when copying data from global to shared memory, my understanding of shared_mem_data[i] = global_mem_data[i] is that it is done in a coalesced atomic fashion, i.e each thread in the warp reads global_data[i] in a single indivisible transaction. Is that correct?

cuda makes no statements about the order of thread execution. Therefore you should not assume any ordering between what is read by one thread and what is read by another, even if they are in the same warp. With respect to atomicity of a single thread that is reading, say, a properly aligned multibyte quantity, those bytes should be coherent, even if they were written by another thread. See here which includes a link to the specific point in the hardware memory model doc supporting this claim — Robert Crovella
In general CUDA makes no statements about the order of thread execution. There may be a few exceptions, such as in the case of warp collective intrinsics that specify a sync mask. And I haven't tried to capture every idea from the other answer I linked here in these comments. Please read it for a more complete description. — Robert Crovella

einpoklum einpoklum · Accepted Answer · 2020-10-31T15:18:47

tl;dr: No.

It is not guaranteed, AFAIK, that all values are read in a single transaction. In fact, a GPU's memory bus is not even guaranteed to be wide enough for a single transaction to retrieve a full warp's width of data (1024 bits for a full warp read of 4 bytes each). It is theoretically for some values in the read-from locations in memory to change while the read is underway.

Cuda: async-copy vs coalesced global memory read atomicity

1 Answers

tl;dr: No.