I'm trying to accelerate this database search application with CUDA, and I'm working on running a core algorithm in parallel with CUDA.
In one test, I run the algorithm in parallel across a digital sequence of size 5000 with 500 blocks per grid and 100 threads per block and came back with a runt time of roughly 500 ms.
Then I increased the size of the digital sequence to 8192 with 128 blocks per grid and 64 threads per block and somehow came back with a result of 350 ms to run the algorithm.
This would indicate that how many blocks and threads used and how they're related does impact performance.
My question is how to decide the number of blocks/grid and threads/block?
Below I have my GPU specs from a standard device query program: