Questions about CUDA latency hiding mechanism and shared memory

Question

I understand that to make a CUDA program efficient, we need to launch enough threads to hide the latency of expensive operations, such as global memory reads. For example, when a thread needs to read from global memory, the other threads will be scheduled to run so that the read operation overlaps with the execution of the threads. Therefore, the overall execution time for a CUDA program is just the sum of each thread's execution time, not including the time for global memory read. However, if we can put the data into shared memory and let the thread read from the shared memory, usually we can make the CUDA program run a lot faster. My confusion is that since the time for memory reads is hidden, it should not affect the program's performance. Why it can still impact the performance of the program so much?

The whole premise of this question makes no sense at all - "if we can put the data into shared memory and let the thread read from the shared memory, usually we can make the CUDA program run a lot faster". That simply isn't true. — talonmies
@talonmies, what's not true about that? It could be worded better, but appropriate use of shared memory can absolutely improve performance of a CUDA kernel. — Brendan Wood
@BrendanWood: The acting of reading from global memory to shared memory, then from shared memory - which is what is described in the question - has no benefit effect on performance whatsoever and is a common misconception (mostly the fault of the programming guide, which says shared memory is faster than global, leading to the conclusion that using it is a silver bullet). The only way shared memory can ever help improve performance is facilitating coalescing of reads or writes to global memory, or data reuse between threads (saving bandwidth). — talonmies
@talonmies, I think that's the perfect answer to xhe8's question. xhe8 is clearly not very familiar with CUDA, and his/her confusion apparently arises from having thought about it logically and not seeing how it's possible to get a performance increase from just putting something in shared memory. The premise is poorly worded (probably due to the misconceptions about shared memory, like you said), but this is an entirely valid question to ask if you can move beyond the semantics of what it means to "put the data into shared memory". — Brendan Wood
@talonmies,Can you point me to some tutorials that explain why shared memory can help improve performance with more detail. Because what you said is somewhat different from I can learn from the programming guide. — xhe8

Unknown Unknown · Accepted Answer · 2016-01-25T06:40:46

The very short answer is that just the mere act of using shared memory won't impart a performance improvement.

The act of reading from global memory to shared memory, then from shared memory - which is what is described in the question - has no benefit effect on performance whatsoever and is a common misconception (mostly the fault of the programming guide, which says shared memory is faster than global, leading to the conclusion that using it is a silver bullet).

The only way shared memory can ever help improve performance is facilitating coalescing of reads or writes to global memory (reducing memory transactions, improving cache coherence), or data sharing or reuse between threads (saving memory bandwidth), or as a faster scratch space than thread local memory stored in DRAM.

[This answer assembled from comments and added as a community wiki entry to get the question off the unanswered list]

Questions about CUDA latency hiding mechanism and shared memory

1 Answers