CUDA: Best number of pixel computed per thread (grayscale)

Question

I'm working on a program to convert an image in grayscale. I'm using the CImg library. I have to read for each pixel, the 3 values R-G-B, calculate the corresponding gray value and store the gray pixel on the output image. I'm working with an NVIDIA GTX 480. Some details about the card:

Microarchitecture: Fermi
Compute capability (version): 2.0
Cores per SM (warp size): 32
Streaming Multiprocessors: 15
Maximum number of resident warps per multiprocessor: 48
Maximum amount of shared memory per multiprocessor: 48KB
Maximum number of resident threads per multiprocessor: 1536
Number of 32-bit registers per multiprocessor: 32K

I'm using a square grid with blocks of 256 threads. This program can have as input images of different sizes (e.g. 512x512 px, 10000x10000 px). I observed that incrementing the number of the pixels assigned to each thread increments the performance, so it's better than compute one pixel per thread. The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number? I know that on the GTX 480, 1536 is the maximum number of resident threads per multiprocessor. Have I to consider this number? The following, is the code executed by the kernel.

for(i = ((gridDim.x + blockIdx.x) * blockDim.x) + threadIdx.x; i < width * height; i += (gridDim.x * blockDim.x)) {
    float grayPix = 0.0f;
    float r = static_cast< float >(inputImage[i]);
    float g = static_cast< float >(inputImage[(width * height) + i]);
    float b = static_cast< float >(inputImage[(2 * width * height) + i]);

    grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
    grayPix = (grayPix * 0.6f) + 0.5f;
    darkGrayImage[i] = static_cast< unsigned char >(grayPix);
}

The performance depends on many factors: Register usage, memory coalescing, and of course the block- and grid size. You might gain some information by entering your numbers into the "NVIDIA CUDA Occupancy Calculator" - this is an XLS (Excel) file available at developer.download.nvidia.com/compute/cuda/… — Marco13
@Marco: It isn't even necessary to use the occupancy spreadsheet any more. The runtime API features cudaOccupancyMaxActiveBlocksPerMultiprocessor which will do all the hard work for you — talonmies
@talonmies Right, that was added in ... CUDA 6 or so? However, I think the spreadsheet can still be handier for a quick check than writing custom code, consulting the API docs and trying out different parametrizations in some modify-compile-run-repeat-cycle. I just wanted to mention it (although I haven't used it extensively myself, and can't say how helpful it really is for a targeted optimization) — Marco13
@Marco: I've already tried the "NVIDIA CUDA Occupancy Calculator". But the only thing that I can see is the impact of varying 1) register count per thread, 2) shared memory per block and 3)block size. There are no suggestions regarding the best amount of blocks to use to gain performance. Am I wrong? — Stefano Sandonà
I'm not sooo familiar with it either, so I think @talonmies may give more focussed advice here. But for example, when you open the XLS with default settings, and then change the "Threads per Block" to 128, you will see that the occupancy (in the upper chart, and in the field "Occupancy of each Multiprocessor") decreases. The maxima seem to be achieved e.g. for 512 or 672 threads (100% or 98% occupancy) — Marco13

Robert Crovella Robert Crovella · Accepted Answer · 2015-12-16T15:01:25

The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number?

Although you haven't shown any code, you've mentioned an observed characteristic:

I observed that incrementing the number of the pixels assigned to each thread increments the performance,

This is actually a fairly common observation for these types of workloads, and it may also be the case that this is more evident on Fermi than on newer architectures. A similar observation occurs during matrix transpose. If you write a "naive" matrix transpose that transposes one element per thread, and compare it with the matrix transpose discussed here that transposes multiple elements per thread, you will discover, especially on Fermi, that the multiple element per thread transpose can achieve approximately the available memory bandwidth on the device, whereas the one-element-per-thread transpose cannot. This ultimately has to do with the ability of the machine to hide latency, and the ability of your code to expose enough work to allow the machine to hide latency. Understanding the underlying behavior is somewhat involved, but fortunately, the optimization objective is fairly simple.

GPUs hide latency by having lots of available work to switch to, when they are waiting on previously issued operations to complete. So if I have a lot of memory traffic, the individual requests to memory have a long latency associated with them. If I have other work that the machine can do while it is waiting for the memory traffic to return data (even if that work generates more memory traffic), then the machine can use that work to keep itself busy and hide latency.

The way to give the machine lots of work starts by making sure that we have enabled the maximum number of warps that can fit within the machine's instantaneous capacity. This number is fairly simple to compute, it is the product of the number of SMs on your GPU and the maximum number of warps that can be resident on each SM. We want to launch a kernel that meets or exceeds this number, but additional warps/blocks beyond this number don't necessarily help us hide latency.

Once we have met the above number, we want to pack as much "work" as possible into each thread. Effectively, for the problem you describe and the matrix transpose case, packing as much work into each thread means handling multiple elements per thread.

So the steps are fairly simple:

Launch as many warps as the machine can handle instantaneously
Put all remaining work in the thread code, if possible.

Let's take a simplistic example. Suppose my GPU has 2 SMs, each of which can handle 4 warps (128 threads). Note that this is not the number of cores, but the "Maximum number of resident warps per multiprocessor" as indicated by the deviceQuery output.

My objective then is to create a grid of 8 warps, i.e. 256 threads total (in at least 2 threadblocks, so they can distribute to each of the 2 SMs) and make those warps perform the entire problem by handling multiple elements per thread. So if my overall problem space is a total of 1024x1024 elements, I would ideally want to handle 1024*1024/256 elements per thread.

Note that this method gives us an optimization direction. We do not necessarily have to achieve this objective completely in order to saturate the machine. It might be the case that it is only necessary, for example, to handle 8 elements per thread, in order to allow the machine to fully hide latency, and usually another limiting factor will appear, as discussed below.

Following this method will tend to remove latency as a limiting factor for performance of your kernel. Using the profiler, you can assess the extent to which latency is a limiting factor in a number of ways, but a fairly simple one is to capture the sm_efficiency metric, and perhaps compare that metric in the two cases you have outlined (one element per thread, multiple elements per thread). I suspect you will find, for your code, that the sm_efficiency metric indicates a higher efficiency in the multiple elements per thread case, and this is indicating that latency is less of a limiting factor in that case.

Once you remove latency as a limiting factor, you will tend to run into one of the other two machine limiting factors for performance: compute throughput and memory throughput (bandwidth). In the matrix transpose case, once we have sufficiently dealt with the latency issue, then the kernel tends to run at a speed limited by memory bandwidth.

CUDA: Best number of pixel computed per thread (grayscale)

1 Answers