CUDA thread and block organization directions

Question

In CUDA programming, threads and blocks have multiple directions (x, y and z).

Until now, I ignored this and only took into account the x direction (threadIdx.x, blockIdx.x, blockDim.x, etc.).

Apparently, both threads within a block and blocks on the grid are arranged as a cube. However, if this is the case, why is it enough to specify the x direction? Would I not address multiple threads like that? Only using the x direction, am I able to address all threads available to my GPU?

Robert Crovella Robert Crovella · Accepted Answer · 2020-11-25T16:56:02

Only using the x direction, am I able to address all threads available to my GPU?

If we are talking about a desire to spin up ~2 trillion threads or less, then there is no particular requirement to use a multidimensional block, or grid. All CUDA GPUs of compute capability 3.0 and higher can launch up to about 2 billion blocks (2^31-1) with 1024 threads each, using a 1-D grid organization.

With methodologies like grid-stride loop it seems rare to me that more than ~2 trillion threads would be needed.

I claim without formal proof that any problem that can be realized in a 1D grid can be realized in a 2D or 3D grid, or vice versa. This is just a mathematical mapping from one realization to another. Furthermore, it should be possible to arrange for important by-products like coalesced access in either realization.

There may be some readability benefits, code complexity benefits, and possibly small performance considerations when realizing in a 1D or multi-dimensional way. The usual case for this that I can think of is when the data to be processed is "inherently" multi-dimensional. In this case, letting the CUDA engine generate 2 or 3 distinct indices for you:

int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;

might be simpler than using a 1D grid index, and computing 2D data indices from those:

int tid = threadIdx.x+blockDim.x*blockIdx.x;
int idx = tid%DATA_WIDTH;
int idy = tid/DATA_WIDTH;

(the integer division operation above is unavoidable in the general case. The modulo operation can be simplified by using the result from the integer division.)

It's arguably an extra line of code and an extra division operation required to get to the same point, when only a 1D grid is created. However I would suggest that even this is small potatoes, and you should use whichever approach seems most reasonable and comfortable to you as a programmer.

If for some reason you desire to spin up more than ~2 Trillion threads, then moving to a multidimensional grid, at least, is unavoidable.

Apparently, both threads within a block and blocks on the grid are arranged as a cube.

To understand how the threadblock thread index is computed in any case, I refer you to the programming guide. It should be evident that one case can be made equivalent to another - each thread gets a unique thread ID no matter how you specify the threadblock dimensions. In my opinion, a threadblock should only be thought of as a "cube" of threads (i.e. 3-dimensional) if you specify the configuration that way:

dim3 block(32,8,4);  //for example

However, if this is the case, why is it enough to specify the x direction? Would I not address multiple threads like that?

If you only used a single threadblock dimension to create a thread index in the 32,8,4 case:

int tid = threadIdx.x;

then you certainly would be "addressing" multiple threads (in y, and z) using that approach. That would typically, in my experience, be "broken" code. Therefore a kernel designed to use a multidimensional block or grid may not work correctly if the block or grid is specified as 1 dimensional, and the reverse statement is also true. You can find examples of such problems (thread index calculation not being correct for the grid design) here on the cuda tag.

CUDA thread and block organization directions

1 Answers