Writing to global memory in CUDA

Question

I would like to ask about the effect of writing to global memory in CUDA. It is known that global memory reads often have great impact to the performance (coalescing, caches, bank conflicts) since they may require quite a lot of cycles to wait for the incoming memory, which may block the execution at a moment.

However what about a writing memory in CUDA? Does it suffer from any type of memory write pattern? Is the total cost straightforwardly is the sum of all the writes in the kernel?

Any related references and comments would be appreciated.

This is exactly the kind of question that would be fun to explore using an experimental program. You could write a basic OpenCL or CUDA program that performs many millions of reads and writes in various patterns. Run your tests over and over in a loop, and see what you get on average. It's probably a good way to learn the boring parts of each API to boot. — James

harrism harrism · Accepted Answer · 2012-02-03T01:47:18

In general the answer to your question is "yes", stores are similar to loads. The difference is that since stores are "fire and forget", if there is work to do that does not depend on the stored addresses then that can be immediately run by the multiprocessor(s) after issuing the stores, and stalls will only happen when read-after-write dependencies are encountered.

For full details, I suggest reading section 5.3.2 of the latest CUDA programming guide.

Also see appendix F of that document for specific information pertaining to different architecture families. For example compute capability 1.x has more performance "cliffs" than compute capability 2.x (Fermi) devices.

Writing to global memory in CUDA

1 Answers