CUDA: How should I handle cases where the number of threads cannot be represented as a dimGrid*dimBlock?

Question

Suppose that my input consists of seven data points on which some calculation is performed and the results are written back to an output array of size 7. Declaring the block dimension to be 4 results in a grid size of 2, which leads to an attempt to run a kernel with the invalid thread id (using pt_id=blockIdx.x*blockDim.x+threadID.x) of 7 and failing due to an invalid memory access (since I access some of my arrays based on thread id). I could add code in my kernel that specifically compares the thread id to a max_thread_id parameter and does nothing if thread_id>max_thread_id but am wondering if there's a prettier way to handle ragged input arrays.

You should pass the array size to the kernel, say N, and check whether pt_id<N. I'm not sure what do you like to achieve with max_thread_id. I believe that the title of your post is not appropriate. The overall number of threads can be always represented as dimGrid*dimBlock. This makes me think that you are making some confusion between array size and grid size. — Vitality
What does prettier mean? What's wrong with an ordinary thread check? It is used quite commonly. — Robert Crovella
Jack - my max_thread_id is exactly your N. To clarify - my question was regarding cases where (dimGrid-1)*dimBlock< number of threads needed < dimGrid*dimBlock. Robert - I'm relatively new to CUDA so I didn't know that this is a common and accepted solution. Thank you both for your comments. — Noam Kremen
You might find this post helpful: CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops — harrism

stuhlo stuhlo · Accepted Answer · 2014-02-17T23:46:25

Having a task of a size which is not multiple of a dimension of block is quite common stuff. The most frequently used solution by me is this. Suppose your input data has size N and you want to configure your launch with the block size equaling to BLOCK_SIZE. In this case, your launch configuration could look like this:

kernel_function<<<(N + BLOCK_SIZE - 1) / BLOCK_SIZE, BLOCK_SIZE>>>(...);

And at the kernel code, each thread determines whether it is supposed to do some work, something like this:

int id = blockIdx.x*blockDim.x + threadIdx.x;
if (id < N) { /* do the stuff */ }
else { return; }

If the size of task (N) depends on the input you have to pass this value into the kernel function as parameter as well. Further, it is quite common to define values of N and BLOCK_SIZE as macros or template parameters.

Finally, if your input array has small size, like in your example, GPU remains underutilized and parallelism brings you nothing or even decreases performance of your algorithm.

CUDA: How should I handle cases where the number of threads cannot be represented as a dimGrid*dimBlock?

1 Answers