What is the difference between coalescence and bank conflicts when programming with cuda?
Is it only that coalescence happens in global memory while bank conflicts in shared memory?
Should I worry about coalescence, if I have a >1.2 supported GPU? Does it handle coalescence by itself?
3 Answers
Yes, coalesced reads/writes are applicable to global reads and bank conflicts are applicable to shared memory reads/writes.
Different compute capability devices have different behaviours here, but a 1.2 GPU still needs care to ensure that you're coalescing reads and writes - it's just that there are some optimisations to make stuff easier for you
You should read the CUDA Best Practices guide. This goes into lots of detail about both these issues.
Yes: coalesced accesses are relevant to global memory only, bank conflicts are relevant to shared memory only.
Check out also the Advanced CUDA C training session, the first section goes into some detail to explain how the hardware in >1.2 GPUs help you and what optimisations you still need to consider. It also explains shared memory bank conflicts too. Check out this recording for example.
The scan and reduction samples in the SDK also explain shared memory bank conflicts really well with progressive improvements to a kernel.
A >1.2 GPU will try to do the best it can wrt coalescing, in that it is able to group memory accesses of the same size that fit within the same memory atom of 256 bytes and write them out as 1 memory write. The GPU will take care of reordering accesses and aligning them to the right memory boundary. (In earlier GPUs, memory transactions within a warp had to be aligned to the memory atom and had to be in the right order.)
However, for optimal performance, you still need to make sure that those coalescing opportunities are available. If all threads within a warp have memory transactions to completely different memory atoms, there is nothing the coalescer can do, so it still pays to be aware about the memory locality behavior of your kernel.