0
votes

just started learning CUDA and there is something I can't quite understand yet. I was wondering whether there is a reason for splitting threads into blocks besides optimizing GPU workload. Because if there isn't, I can't understand why would you need to manually specify the number of blocks and their sizes. Wouldn't that be better to simply supply the number of threads needed to solve the task and let the GPU distribute the threads over the SMs?

That is, consider the following dummy task and GPU setup.

number of available SMs: 16
max number of blocks per SM: 8
max number of threads per block: 1024

Let's say we need to process every entry of a 256x256 matrix and we want a thread assigned to every entry, i.e. the overall number of threads is 256x256 = 65536. Then the number of blocks is:

overall number of threads / max number of threads per block = 65536 / 1024 = 64

Finally, 64 blocks will be distributed among 16 SMs, making it 8 blocks per SM. Now these are trivial calculations that GPU could handle automatically, right?.

The only other reason for manually supplying the number of blocks and their sizes, that I can think of, is separating threads in a specific fashion in order for them to have shared local memory, i.e. somewhat isolating one block of threads from another block of threads.

But surely there must be another reason?

1
Threads within a block are guaranteed to run concurrently, so they can communicate (particularly via shared memory). So your suspicion is indeed correct.tera

1 Answers

1
votes

I will try to answer your question from the point of view what I understand best.

The major factor that decides the number of threads per block is the multiprocessor occupancy.The occupancy of a multiprocessor is calculated as the ratio of the active warps to the max. number of active warps that is supported. The threads of a warps may be active or dormant for many reasons depending on the application. Hence a fixed structure for the number of threads may not be viable.

Besides each multiprocessor has a fixed number of registers shared among all the threads of that multiprocessor. If the total registers needed exceeds the max. number, the application is liable to fail.

Further to the above, the fixed shared memory available to a given block may also affect the decision on the number of threads, in case the shared memory is heavily used.

Hence a naive way to decide the number of threads is straightforwardly using the occupancy calculator spreadsheet in case you want to be completely oblivious to the type of application at hand. The other better option would be to consider the occupancy along with the type of application being run.