0
votes

I'm trying to accelerate this database search application with CUDA, and I'm working on running a core algorithm in parallel with CUDA.

In one test, I run the algorithm in parallel across a digital sequence of size 5000 with 500 blocks per grid and 100 threads per block and came back with a runt time of roughly 500 ms.

Then I increased the size of the digital sequence to 8192 with 128 blocks per grid and 64 threads per block and somehow came back with a result of 350 ms to run the algorithm.

This would indicate that how many blocks and threads used and how they're related does impact performance.

My question is how to decide the number of blocks/grid and threads/block?

Below I have my GPU specs from a standard device query program: enter image description here

1

1 Answers

2
votes

You should test it because it depends on your particular kernel. One thing you must aim to do is to make the number of threads per block a multiple of the number of threads in a warp. After that you can aim for high occupancy of each SM but that is not always synonymous with higher performance. It was been shown that sometimes lower occupancy can give better performance. Memory bound kernels usually benefit more from higher occupancy to hide memory latency. Compute bound kernels not so much. Testing the various configurations is your best bet.