CUDA optimisation - kernel launch conditions

Question

I am fairly new to CUDA and would like to find out more about optimising kernel launch conditions to speed up my code. This is quite a specific scenario but I'll try to generalise it as much as possible so anyone else with a similar question can gain from this in the future.

Assume I've got an array of 300 elements (Array A) that is sent to the kernel as an input. This array is made of a few repeating integers with each integer having a device function specific to it. For example, every time 5 appears in Array A, the kernel performs the function specific to 5. These functions are device functions.

How I have parallelised this problem is by launching 320 blocks (probably not the best number) so that each block will perform the device function relevant to its element in parallel.

The CPU would handle the entire problem in a serial fashion where it will take element by element and call each function one after the other whereas the GPU would allocate an element to each block so that all 320 blocks can access the relevant device functions and calculate simultaneously.

In theory for a large number of elements the GPU should be faster - at least I though so but in my case it isn't. My assumption is that since 300 elements is a small number the CPU will always be faster than the GPU.

This is acceptable BUT what I want to know is how I can cut down the GPU execution time at least by a little. Currently, the CPU takes 2.5 milliseconds and the GPU around 12 ms.

Question 1 - How can I choose the optimum number of blocks/threads to launch at the start? First I tried 320 blocks with 1 thread per block. Then 1 block with 320 threads. No real change in execution time. Will tweaking the number of blocks/threads improve the speed?

Question 2 - If 300 elements is too small, why is that, and roughly how many elements do I need to see the GPU outperforming the CPU?

Question 3 - What optimisation techniques should I look into?

Please let me know if any of this isn't that clear and I'll expand on it.

Thanks in advance.

@talonmies Thanks for the link I've been working within the hard limits of my hardware. From what I understand, there is no simple answer to block/thread number and its more a trial and error thing? I've saved all my input elements in constant memory so that access time should be fast. Any other basic optimisation techniques for newbies I should look into? — user2550888
This sounds like hugely divergent task. E.g. if you get different numbers in the same warp the warp will essentially have to sequentially process the numbers. This will have really negative impact on the performance. One thing I would consider is having warps (or even kernels) dedicated to specific tasks (e.g. threads 0-warpSize would be devoted to processing elements with 5) and then try to devise the scheme to assign the work to these threads. — Eugene
1. It's very unlikely that kernel launch configurations of either <<<1, 320>>> or <<<320, 1>>> could ever come close to fully utilizing the machine. 2. If you can sort your array A first, you will probably get better (much less divergent) GPU results. — Robert Crovella

Eugene Eugene · Accepted Answer · 2013-07-18T02:03:32

Internally, CUDA manages threads in groups of 32 (so-called warps). If you have 1 thread per block device will still execute 32 of those - 31 thread will simply be in divergent state. This is potentially an occupancy issue though you may not observe it on your device and with your problem size. There is also limit on number of blocks given multiprocessor (SM) can execute. AFAIR, GeForce 4x can run up to 8 blocks on one SM. Hence if you have a device with 8 SMs you can simultaneously run 64 threads if you have block size of 1. You can use a tool called occupancy calculator to estimate a better block size - or you can use a visual profiler.
This can only be decided by profiling. There are too many unknowns - e.g. what is your ratio of memory accesses to actual computations, how parallelizable your task is, etc.
I would really recommend you to start with best practices guide.

CUDA optimisation - kernel launch conditions

1 Answers