4
votes

Let me start by saying that I've read carefully all similar questions on SO:

  1. Determining threads per block and block per grid
  2. Threads per SM, threads per block
  3. CUDA Blocks and Threads
  4. Warps and optimal number of blocks

My intention is to try and calculate dynamically (rather than hardcoding values) for a feed-forward neural net library I am developing.

My data is not a square lattice (a matrix) as is often with most examples I've seen, it is instead two vectors producing a matrix, with unequal rows to columns:

float x[6] {1.f, 1.f, 0.f, 1.f, 1.f, 0.f}; 
thrust::device_vector<float> in_vec( x, x+6 );
float y[9] {1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f};
thrust::device_vector<float> w_vec( y, y+9 );
thrust::device_vector<float> o_wec(9);
thrust::device_vector<float> mtx_vec( 9 * 6 );

float * i_ptr = thrust::raw_pointer_cast( in_vec.data() );
float * w_ptr = thrust::raw_pointer_cast( w_vec.data() );
float * out_ptr = thrust::raw_pointer_cast( mtx_vec.data() );

dim3 threadsPerBlock(9,6);
dim3 numBlocks(1,1);
prop_mtx<<<numBlocks,threadsPerBlock>>>( w_ptr, i_ptr, out_ptr, 6 );

and the kernel:

__global__ void prop_mtx( float * w, float * i, float * o, int s ) 
{
    int x = blockIdx.x * blockDim.x + threadIdx.x; 
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    o[y + x * s] = w[x] * i[y];
}

The reason why I've taken this approach is because it makes sense in ANN computation, when it comes to vector/matrix calculations. I'd like to keep this consistent, and AFAIK using a 2D grid for Weight * Input calculations is reasonable.

I have to compute my threads per block as a 2D with unequal numbers of threads in the grid.

I am ussing a GTX 660, which has:

  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 2047 MBytes 
  ( 5) Multiprocessors, (192) CUDA Cores/MP:     960 CUDA Cores
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)

I am trying to understand how I can deduce/compute the grid size, threads per block, and number of blocks.

Let us assume I have a weight vector of 800 items, and an input vector of 6500 items.

  1. Does this imply that what I really need, is a 2D grid of 800,6500? As far as I understand, anything else will provide incorrect results?

I know my maximum threads per block is 1024, but because its a 2D grid, it would more likely be:

dim3 threadPerBlock(X,Y);
  1. Due to the fact that my grid is not a square matrix, I need to calculate the X, Y threads per block in a different way?

  2. Or I need to deduce the number of blocks needed first?

Finally, since my thread warp size is 32,

  1. Does the minimum grid size, regardless of all other parameters need to be at least 32, or a multiple of 32? Do I need at least 32 threads per block, or a grid size where the smallest number is 32?

Any pseudo-code, or explanation of how I should go about this, would be greatly appreciated.

What I have tried, is to calculate my 2D grid size, by dividing my data by 32 wrap size. Then I considered calculating the grid threads by using the available SMs. For example

800 weights / 5 SM, = 160 x's per SM
6500 inputs  / 5 SM, = 1300 y's per SM

But I didn't know what to do from there on. Finally, I considered finding the input-weight ratio first:

6500/800 = 8.125

Implying that using the 32 minimum grid size for X, Y would have to be multiplied by 8.125 * 32 Hence, my threadsPerBlock would be:

dim3 threadsPerBlock(32,260);

That is of course, 8320 threads per block, which far exceeds the 1024 per block.

So this is my issue: how do I not exceed the 1024 threads per block, whilst retaining the correct grid size of my data?

PS: My question is not about optimising the code, but understanding how to distribute the threads and grid data over the device.

1
@talonmies whilst your reply there is very helpful, it doesn't answer all of my questions: how do I deduce number of threads (total?) so that the grid is aligned to the data, or do I not need to align it? One of the answers has the following: gridSize = (N + blockSize - 1) / blockSize; Do I just need to calculate the threads per block regardless of the grid X,Y?Ælex
Yes, pick an arbitrary block size, like 32x32. Then divide your total x-grid-width (800) by the x-block-dimension (32) and launch that many blocks (plus one) in the x direction of your grid. Then divide your total y-grid-width (6500) by the y-block-dimension (32) and launch that many blocks (plus one) in the y direction of your grid. The number of SM's in your GPU don't factor into this. And I would assume that just like in your trivial case you needed 9x6 threads, in the larger case you will need 800x6500 threads total. This methodology is covered in numerous places.Robert Crovella
Your kernel will also need a "thread check" to prevent threads outside of the desired 800x6500 region from doing anything. There is no "32 minimum grid size". The recommendation is to make your threadblock size a whole-number multiple of 32. 32x32 satisfies that.Robert Crovella
"thread check" is just what I call it. If you want to see other examples, try putting "user:1695960 thread check" in the search box in the upper right hand corner of this page.Robert Crovella

1 Answers

7
votes

One approach to categorizing computation problems is to discuss transformations and reductions.

A reduction is a category of problem which takes a large input data set size, and produces a small output data set size. For example, taking an image and finding the maximum pixel value would be a reduction. For this discussion, we will ignore reductions.

A transformation is a category of computation where the output data set size (number of elements) is either "large" or "approximately the same" as the input data set size. For example, taking an image and producing a blurred image would be a transformation.

For transformations, a common approach ("thread strategy") to writing a cuda kernel (the thread code) will be to make one unique thread responsible for each point in the output array. Therefore, the total minimum number of threads that I must have is equal to the size of my output array. The thread code is just the set of computations needed on the input data, in order to produce one output data point. Roughly speaking then, your problem, and simplified kernel, fit this definition; it is a transformation.

Following the above thread strategy, we will need a total number of threads in our grid equal to the total number of output points I need to create. For 2D problems, it is often convenient to think about these two-dimensionally, and CUDA provides 2D (or 3D) threadblock organization and 2D (or 3D) grid organization, for this purpose.

Choice of CUDA threadblock dimensions is often somewhat arbitrary. Generally speaking, we typically want to aim for threadblocks in the 128 - 512 threads per block range (for reasons that are covered elsewhere) and we want threadblocks that are whole-number multiples of 32 (the warp size) for efficiency when the threadblock gets subdivided into warps, which are the actual unit of CUDA execution. On currently supported GPUs, threadblocks are limited to 1024 threads per block (total - i.e. the product of the dimensions). However, for many problems, threadblock choices within this range (e.g. 256 threads vs. 512 threads) often have relatively little impact on performance. In the interest of getting something working, we don't sweat the details at this point. (When you're coming back for optimization, you may revisit this choice.)

So far we've learned that for this problem type, we need a total number of threads to cover our problem space, and we will have a somewhat arbitrary threadblock dimension choice. So let's choose (32,16) (x,y) to start with, for a total of 512 threads. There are no rules that state that theadblocks need be "square", or that grids need be "square", or that there should even be any sort of ratiometric parity between threadblock dimensions and problem size (or grid dimensions.)

Now that we have a threadblock choice of (32,16) in mind, we must ask ourselves "how many of these do I need?". This problem is 2D and so we've chosen a 2D threadblock for simplicity of index generation in the thread code. Let's choose a 2D grid as well - it makes sense for a 2D problem, and again for 2D simplicity of index generation. So we can consider the two dimensions independently.

So, how many blocks do I need in the x-direction? I need at least as many as (my problem size in x)/(my threadblock size in x). Since we are dealing with all integers here, this begs the question "what if my problem size is not evenly divisible by my threadblock size?" The canonical solution is to launch more than enough threads to cover the space, or enough blocks to cover the space. But in the non-evenly-divisible case, this will result in "extra threads". We'll discuss and deal with these shortly. Therefore, if I have a dim3 variable like this for threadblock dimensions:

    #define BX 32
    #define BY 16   
    ...
    dim3 block(BX,BY);

then I might construct my dim3 grid variable like this:

    #define DX 800
    #define DY 6500
    ...
    dim3 grid((DX+block.x-1)/block.x, (DY+block.y-1)/block.y);

If you work through this arithmetic, you will see that this causes us to launch enough blocks in the x and y direction, so that we will have at least enough threads to cover our problem space of (DX,DY), one thread per output point.

Hopefully it is clear that the Y dimension is treated separately and independently from the x-dimension.

The above calculations will usually result in the generation of "too many" threads in my grid. I will have some "extra threads" beyond the end of my problem space (DX, DY) that I need to handle. We want these threads to "do nothing". The canonical way to handle this, is to pass the problem space dimensions to my kernel, create an appropriate globally unique thread index in my kernel, then compare that index to the maximum index in my problem space. If it exceeds it, we simply have that thread skip all remaining thread code.

Using your kernel as an example, it might look like this:

__global__ void prop_mtx( float * w, float * i, float * o, int s, const size_t d_size_x, const size_t d_size_y ) 
{
    int x = blockIdx.x * blockDim.x + threadIdx.x; 
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if ((x < d_size_x) && (y < d_size_y))  // thread check
      o[y + x * s] = w[x] * i[y];
}

Note that such a thread check will create threads (in some blocks) that are "not participating" in the subsequent code. A point to be aware of here is that the usage of __syncthreads() depends on all threads in a block participating. Therefore, we should not use __syncthreads() directly in such a case. Instead, we have to condition threadblock behavior appropriately:

__global__ void prop_mtx( float * w, float * i, float * o, int s, const size_t d_size_x, const size_t d_size_y ) 
{
    int x = blockIdx.x * blockDim.x + threadIdx.x; 
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if ((x < d_size_x) && (y < d_size_y))  // thread check
      {
         o[y + x * s] = w[x] * i[y];
         // and other code not dependent on __syncthreads()
       }
     // now it is safe to use since all threads are participating
     __syncthreads();
    if ((x < d_size_x) && (y < d_size_y))  // thread check
      {
          // rest of kernel code
       }
}

Note that it is possible to have a smaller number of threads perform the necessary computations for a larger number of output data points. The 1:1 correspondence between threads and output data is an easy way to think about and write the cuda kernel code, but it's not the only way. One other possible method would be to use some form of a grid-striding loop, so that a smaller grid can cover a larger problem space. Discussion of those strategies is outside the scope of this answer, and the basic methodology discussed in this answer should be understood before tackling other approaches.