I would like to ask about the effect of writing to global memory in CUDA. It is known that global memory reads often have great impact to the performance (coalescing, caches, bank conflicts) since they may require quite a lot of cycles to wait for the incoming memory, which may block the execution at a moment.
However what about a writing memory in CUDA? Does it suffer from any type of memory write pattern? Is the total cost straightforwardly is the sum of all the writes in the kernel?
Any related references and comments would be appreciated.