I understand that to make a CUDA program efficient, we need to launch enough threads to hide the latency of expensive operations, such as global memory reads. For example, when a thread needs to read from global memory, the other threads will be scheduled to run so that the read operation overlaps with the execution of the threads. Therefore, the overall execution time for a CUDA program is just the sum of each thread's execution time, not including the time for global memory read. However, if we can put the data into shared memory and let the thread read from the shared memory, usually we can make the CUDA program run a lot faster. My confusion is that since the time for memory reads is hidden, it should not affect the program's performance. Why it can still impact the performance of the program so much?
1 Answers
The very short answer is that just the mere act of using shared memory won't impart a performance improvement.
The act of reading from global memory to shared memory, then from shared memory - which is what is described in the question - has no benefit effect on performance whatsoever and is a common misconception (mostly the fault of the programming guide, which says shared memory is faster than global, leading to the conclusion that using it is a silver bullet).
The only way shared memory can ever help improve performance is facilitating coalescing of reads or writes to global memory (reducing memory transactions, improving cache coherence), or data sharing or reuse between threads (saving memory bandwidth), or as a faster scratch space than thread local memory stored in DRAM.
[This answer assembled from comments and added as a community wiki entry to get the question off the unanswered list]