Threads hierarchy design in kernel in CUDA

Question

Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance? (case 1)

1st block - 512 threads 2nd block - remaining threads

(case 2) distribute equal number of threads across certain blocks.

If a kernel needs some 600 threads, will d best option be to allocate 300 threads in 2 blocks. or is there any option to utilise all 512 threads of 1st block and remaining in 2nd block ? — cuda-dev
I think it depends on the problem you are trying to solve. Could you be a little more specific? — KLee1
also if my kernel needs 601 threads or any odd number as such, how should one allocate the blocks ? — cuda-dev

KLee1 KLee1 · Accepted Answer · 2010-07-12T06:52:18

I don't think that it really matters, but it is more important to group the thread blocks logically, so that you are able to use other CUDA optimizations (like memory coalescing)

This link provides some insight into how CUDA will (likely) and organize your threads.

A quote from the summary:

To summarize, special parameters at a kernel launch define the dimensions of a grid and its blocks. Unique coordinates in blockId and threadId variables allow threads of a grid to distinguish among them. It is the programmer's responsibility to use these variables in the kernel functions so that the threads can properly identify the portion of the data to process. These variables compel the programmers to organize threads and there data into hierarchical and multi-dimensional organizations.

Threads hierarchy design in kernel in CUDA

2 Answers