Maximum matrix size single block CUDA implementation

Question

I'm reading a text stating that the maximum size of a square matrix which is the product of a multiplying two (same dimension) square-matrices (in CUDA using one grid block with a maximum of 512 threads) is 16x16, since 32x32 exceeds 512 (assuming we'd like individual threads to calculate each element of the product matrix). I'm wondering why matrices of dimensions such as 17x17 or 22x22 aren't mentioned, as the product matrix for these dimensions does not exceed 512 elements either. Is this a memory alignment matter?

Robert Crovella Robert Crovella · Accepted Answer · 2013-08-27T13:59:21

There are many (unstated) assumptions in the text you are describing here.

512 threads per block is a limit of cc 1.x devices. newer devices are limited to 1024 threads per block.

Another assumption is that each thread will only be responsible for one data element, ie. one point in the output matrix. This therefore limits you to 512 output points (or 1024 output points) per threadblock. Many naive matrix multiply codes work this way, but it doesn't have to be this way. There's nothing that would prevent you from writing a code that handles a number of (16x16) sub-matrices in sequence, for example, in a single threadblock.

Finally, the text is assuming that you want to adhere to the general CUDA recommendation that threadblocks consist of a number of threads equal to an integer multiple of 32, the warp size. From here:

The number of threads per block should be chosen as a multiple of the warp size to avoid wasting computing resources with under-populated warps as much as possible.

17x17 and 22x22 don't create products that are integer multiples of 32.

Maximum matrix size single block CUDA implementation

1 Answers