Why use thread blocks larger than the number of cores per multiprocessor

Question

I have a Nvidia GeForce GTX 960M graphics card, which has the following specs:

Multiprocessors: 5
Cores per multiprocessor: 128 (i.e. 5 x 128 = 640 cores in total)
Max threads per multiprocessor: 2048
Max block size (x, y, z): (1024, 1024, 64)
Warpsize: 32

If I run 1 block of 640 threads, then a single multiprocessor gets a workload of 640 threads, but will run concurrently only 128 threads at a time. However, if I run 5 blocks of 128 threads then each multiprocessor gets a block and all 640 threads are run concurrently. So, as long as I create blocks of 128 threads then the distribution of threads per multiprocessor can be as evenly as possible (assuming at least 640 threads in total).

My question then is: why would I ever want to create blocks of sizes larger than the number of cores per multiprocessor (as long as I'm not hitting the max number of blocks per dimension)?

Unknown Unknown · Accepted Answer · 2019-07-23T13:28:01

If I run 1 block of 640 threads, then a single multiprocessor gets a workload of 640 threads, but will run concurrently only 128 threads at a time.

That isn't correct. All 640 threads run concurrently. The SM has instruction latency and is pipelined, so that all threads are active and have state simultaneously. Threads are not tied to a specific core and the execution model is very different from a conventional multi-threaded CPU execution model.

However, if I run 5 blocks of 128 threads then each multiprocessor gets a block and all 640 threads are run concurrently.

That may happen, but it is not guaranteed. All blocks will run. What SM they run on is determined by the block scheduling mechanism, and those heuristics are not documented.

So, as long as I create blocks of 128 threads then the distribution of threads per multiprocessor can be as evenly as possible (assuming at least 640 threads in total).

From the answers above, that does not follow either.

My question then is: why would I ever want to create blocks of sizes larger than the number of cores per multiprocessor (as long as I'm not hitting the max number of blocks per dimension)?

Because threads are not tied to cores, the architecture has a lot of latency and requires a significant number of threads in flight to hide all that latency and reach peak performance. Unfortunately basically none of the theses you suppose in your question are correct or relevant to determining the optimal number of blocks or their size for a given device.

Why use thread blocks larger than the number of cores per multiprocessor

1 Answers