1
votes

Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance? (case 1)

1st block - 512 threads 2nd block - remaining threads

(case 2) distribute equal number of threads across certain blocks.

2
If a kernel needs some 600 threads, will d best option be to allocate 300 threads in 2 blocks. or is there any option to utilise all 512 threads of 1st block and remaining in 2nd block ? - cuda-dev
I think it depends on the problem you are trying to solve. Could you be a little more specific? - KLee1
also if my kernel needs 601 threads or any odd number as such, how should one allocate the blocks ? - cuda-dev
@KLee1, this was a generic question. :) - cuda-dev

2 Answers

1
votes

I don't think that it really matters, but it is more important to group the thread blocks logically, so that you are able to use other CUDA optimizations (like memory coalescing)

This link provides some insight into how CUDA will (likely) and organize your threads.

A quote from the summary:

To summarize, special parameters at a kernel launch define the dimensions of a grid and its blocks. Unique coordinates in blockId and threadId variables allow threads of a grid to distinguish among them. It is the programmer's responsibility to use these variables in the kernel functions so that the threads can properly identify the portion of the data to process. These variables compel the programmers to organize threads and there data into hierarchical and multi-dimensional organizations.

0
votes

It is preferable to divide equally the threads into two blocks, in order to maximize the computation / memory access overlap. When you have for instance 256 threads in a block, they do not compute all in the same time, there are scheduled on the SM by warp of 32 threads. When a warp is waiting for a global memory data, another warp is scheduled. If you have a small block of threads, your global memory accesses are a lot more penalizing.

Furthermore, in your example you underuse your GPU. Just remember that a GPU have dozens of multiprocessors (eg. 30 for the C1060 Tesla), and a block is mapped to a multiprocessor. In your case, you will only use 2 multiprocessors.