GPGPU: CUDA kernel configuration for 1D thread indexing - threads, blocks, shared memory, and registers

Question

Suppose I have N tasks, where each tasks can be performed by a single thread on the GPU. Suppose also that N = number of threads on the GPU.

Question 1: Is the following an appropriate way to launch a 1D kernel of maximum size? Will all N threads that exist on the GPU perform the work?

cudaDeviceProp  theProps;

dim3 mygrid(theProps.maxGridSize[0], 1, 1);
dim3 myblocks(theProps.maxThreadsDim[0], 1, 1);

mykernel<<<mygrid, myblocks>>>(...);

Question 2: What is the property cudaDeviceProp::maxThreadsPerBlock in relation to cudaDeviceProp::maxThreadsDim[0] ? How do they differ? Can cudaDeviceProp::maxThreadsPerBlock be substituted for cudaDeviceProp::maxThreadsDim[0] above?

Question 3: Suppose that I want to divide the shared memory of a block equally amongst the threads in the block, and that I want the most amount of shared memory available for each thread. Then I should maximize the number of blocks, and minimize the number of threads per block, right?

Question 4: Just to confirm (after reviewing related questions on SO), in the linear (1D) grid/block scheme above, a global unique thread index is unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x. Right?

Robert Crovella Robert Crovella · Accepted Answer · 2013-05-31T17:27:02

It's recommended to ask one question per question. Having all sorts of questions makes it very difficult for anyone to give a complete answer. SO isn't really a tutorial service. You should avail yourself of the existing documentation, webinars, and of course there are many other resources available.

Is the following an appropriate way to launch a 1D kernel of maximum size? Will all N threads that exist on the GPU perform the work?

It's certainly possible, all of the threads launched (say it is called N) will be available to perform work, and it will launch a grid of maximum (1D) size. But why do you want to do that anyway? Most cuda programming methodologies don't start out with that as a goal. The grid should be sized to the algorithm. If the 1D grid size appears to be a limiter, you can work around by performing loops in the kernel to handle multiple data elements per thread, or else launch a 2D grid to get around the 1D grid limit. The limit for cc3.x devices has been expanded.

What is the property cudaDeviceProp::maxThreadsPerBlock in relation to cudaDeviceProp::maxThreadsDim[0] ? How do they differ? Can cudaDeviceProp::maxThreadsPerBlock be substituted for cudaDeviceProp::maxThreadsDim[0] above?

The first is a limit on the total threads in a multidimensional block (i.e. threads_x*threads_y*threads_z). The second is a limit on the first dimension (x) size. For a 1D threadblock, they are interchangeable, since the y and z dimensions are 1. For a multidimensional block, the multidimensional limit exists to inform users that threadsblocks of, for example, maxThreadsDim[0]*maxThreadsDim[1]*maxThreadsDim[2] are not legal.

Suppose that I want to divide the shared memory of a block equally amongst the threads in the block, and that I want the most amount of shared memory available for each thread. Then I should maximize the number of blocks, and minimize the number of threads per block, right?

Again, I'm a bit skeptical of the methodology. But yes, the theoretical maximum of possible shared memory bytes per thread would be achieved by a threadblock of smallest number of threads. However, allowing a threadblock to use all the available shared memory may result in only having one threadblock that can be resident on an SM at a time. This may have negative consequences for occupancy, which may have negative consequences for performance. There are many useful recommendations for choosing threadblock size, to maximize performance. I can't summarize them all here. But we want to choose threadblock size as a multiple of warp size, we generally want multiple warps per threadblock, and all other things being equal, we want to enable maximum occupancy (which is related to the number of threadblocks that can be resident on an SM).

Just to confirm (after reviewing related questions on SO), in the linear (1D) grid/block scheme above, a global unique thread index is unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x. Right?

Yes, for a 1-D threadblock and grid structure, this line will give a globally unique thread ID:

unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

GPGPU: CUDA kernel configuration for 1D thread indexing - threads, blocks, shared memory, and registers

1 Answers