I have implemented a string matching algorithm on the GPU. The searching time of a parallel version has been decreased considerably compared with the sequential version of the algorithm, but by using different number of blocks and threads I get different results. How can I determine the number of the blocks and threds to get the best results?
2 Answers
I think this question is hard, if not impossible, to answer for the reason that it really depends on the algorithm and how it is operating. Since i cant see your implementation i can give you some leads:
Don't use global memory & check how you can max out the use of shared memory. Generally get a good feel of how threads access memory and how data is retrieved etc.
Understand how your warps operate. Sometimes threads in a warp may wait for other threads to finish in case you have 1 to 1 mapping between thread and data. So instead of this 1 to 1 mapping, you can map threads to multiple data so that they are kept busy.
Since blocks consist of threads that are group in 32 threads warp, it is the best if the number of threads in a block is a multiple of 32, so that you dont get warps consisting of 3 threads etc.
Avoid Diverging paths in warps.
I hope it helps a bit.
@Chris points are very important too but depend more on the algorithm itself.
Check the Cuda Manual about Thread alignment regarding memory lookups. Shared Memory Arrays should also be size of multiple of 16.
Use Coalesced global memory reads. But by algorithm design this is often the case and using shared memory helps.
Don't use atomic operations in global memory or at all if possible. They are very slow. Some algorithms using atomic operations can be rewritten using different techniques.
Without shown code no-one can tell you what is the best or why performance changes.
The number of threads per block of your kernel is the most important value.
Important values to calculate that value are:
- Maximum number of resident threads per multiprocessor
- Maximum number of resident blocks per multiprocessor
- Maximum number of threads per block
- Number of 32-bit registers per multiprocessor
Your algorithms should be scalable across all GPU's reaching 100% occupancy. For this I created myself a helper class which automatically detects the best thread numbers for the used GPU and passes it to the Kernel as a DEFINE.
/**
* Number of Threads in a Block
*
* Maximum number of resident blocks per multiprocessor : 8
*
* ///////////////////
* Compute capability:
* ///////////////////
*
* Cuda [1.0 - 1.1] =
* Maximum number of resident threads per multiprocessor 768
* Optimal Usage: 768 / 8 = 96
* Cuda [1.2 - 1.3] =
* Maximum number of resident threads per multiprocessor 1024
* Optimal Usage: 1024 / 8 = 128
* Cuda [2.x] =
* Maximum number of resident threads per multiprocessor 1536
* Optimal Usage: 1536 / 8 = 192
*/
public static int BLOCK_SIZE_DEF = 96;
Example Cuda 1.1 to reach 786 resident Threads per SM
- 8 Blocks * 96 Threads per Block = 786 threads
- 3 Blocks * 256 Threads per Block = 786 threads
- 1 Blocks * 512 Threads per Block = 512 threads <- 33% of GPU will be idle
This is also mentioned in the book:
Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)
Good programming advices:
- Analyse your kernel code and write down the maximal number of threads it can handle or how many "units" it can process.
- Also output your register usage and try to lower it to the respective targeted CUDA version. Because if you use too many registers in your kernel less blocks will be executed resulting in less occupancy and performance.
Example: Using Cuda 1.1 and using optimal number of 768 resident threads per SM you have 8192 registers to use. This leads to 8192 / 768 = 10 maximum registers per thread/kernel. If you use 11 the GPU will use 1 Block less resulting in decreased performance.
Example: A matrix independent row vector normalizing kernel of mine.
/*
* ////////////////////////
* // Compute capability //
* ////////////////////////
*
* Used 12 registers, 540+16 bytes smem, 36 bytes cmem[1]
* Used 10 registers, 540+16 bytes smem, 36 bytes cmem[1] <-- with -maxregcount 10 Limit for Cuda 1.1
* I: Maximum number of Rows = max(x-dim)^max(dimGrid)
* II: Maximum number of Columns = unlimited, since they are loaded in a tile loop
*
* Cuda [1.0 - 1.3]:
* I: 65535^2 = 4.294.836.225
*
* Cuda [2.0]:
* II: 65535^3 = 281.462.092.005.375
*/