cuda threads and blocks

Question

I posted this on the NVIDIA forums, I thought I would get a few more eyes to help.

I'm having trouble trying to expand my code out to perform with multiple cases. I have been developing with the most common case in mind, now its time for testing and i need to ensure that it all works for the different cases. Currently my kernel is executed within a loop (there are reasons why we aren't doing one kernel call to do the whole thing.) to calculate a value across the row of a matrix. The most common case is 512 columns by 512 rows. I need to consider matricies of the size 512 x 512, 1024 x 512, 512 x 1024, and other combinations, but the largest will be a 1024 x 1024 matrix. I have been using a rather simple kernel call:

launchKernel<<<1,512>>>(................)

This kernel works fine for the common 512x512 and 512 x 1024 (column, row respectively) case, but not for the 1024 x 512 case. This case requires 1024 threads to execute. In my naivety i have been trying different versions of the simple kernel call to launch 1024 threads.

launchKernel<<<2,512>>>(................)  // 2 blocks with 512 threads each ???
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???

I beleive my problem has something to do with my lack of understanding of the threads and blocks

Here is the output of deviceQuery, as you can see i can have a max of 1024 threads

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla C2050"
  CUDA Driver Version / Runtime Version          4.2 / 4.1
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 2688 MBytes (2818572288 bytes)
  (14) Multiprocessors x (32) CUDA Cores/MP:     448 CUDA Cores
  GPU Clock Speed:                               1.15 GHz
  Memory Clock rate:                             1500.00 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 786432 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                Yes
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           40 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro 600"
  CUDA Driver Version / Runtime Version          4.2 / 4.1
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 2) Multiprocessors x (48) CUDA Cores/MP:     96 CUDA Cores
  GPU Clock Speed:                               1.28 GHz
  Memory Clock rate:                             800.00 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 131072 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           15 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600

I am using only the Tesla C2050 device Here is a stripped out version of my kernel, so you have an idea of what it is doing.

#define twoPi               6.283185307179586
#define speed_of_light      3.0E8
#define MaxSize             999

__global__ void calcRx4CPP4
(  
        const float *array1,  
        const double *array2,  
        const float scalar1,  
        const float scalar2,  
        const float scalar3,  
        const float scalar4,  
        const float scalar5,  
        const float scalar6,  
        const int scalar7,  
        const int scalar8,    
        float *outputArray1,
        float *outputArray2)  
{  

    float scalar9;  
    int idx;  
    double scalar10;
    double scalar11;  
    float sumReal, sumImag;  
    float real, imag;  

    float coeff1, coeff2, coeff3, coeff4;  

    sumReal = 0.0;  
    sumImag = 0.0;  

    // kk loop 1 .. 512 (scalar7)  
    idx = (blockIdx.x * blockDim.x) + threadIdx.x;  

    /* Declare the shared memory parameters */
    __shared__ float SharedArray1[MaxSize];
    __shared__ double SharedArray2[MaxSize];

    /* populate the arrays on shared memory */
    SharedArray1[idx] = array1[idx];  // first 512 elements
    SharedArray2[idx] = array2[idx];
    if (idx+blockDim.x < MaxSize){
        SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];
        SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];
    }            
    __syncthreads();

    // input scalars used here.
    scalar10 = ...;
    scalar11 = ...;

    for (int kk = 0; kk < scalar8; kk++)
    {  
        /* some calculations */
        // SharedArray1, SharedArray2 and scalar9 used here
        sumReal = ...;
        sumImag = ...;
    }  


    /* calculation of the exponential of a complex number */
    real = ...;
    imag = ...;
    coeff1 = (sumReal * real);  
    coeff2 = (sumReal * imag);  
    coeff3 = (sumImag * real);  
    coeff4 = (sumImag * imag);  

    outputArray1[idx] = (coeff1 - coeff4);  
    outputArray2[idx] = (coeff2 + coeff3);  


}

Because my max threads per block is 1024, I thought I would be able to continue to use the simple kernel launch, am I wrong?

How do I successfully launch each kernel with 1024 threads?

what is the actual problem? what is not working? if you have 1 block and 1024 threads you need a shared array of size 1024 and not MaxSize = 999 with your kind of indexing. — djmj
MaxSize = 999, isn't related to the threads, its just hte size of the arrays being copied to shared memory. Each thread must loop over this entire array to get a sum (simplification of the actual algorithm). The problem is that it wont work when i try to use 1024 threads in the case of 512 columns and 1024 rows. The resulting outputArray1/2 are not being completly filled up, so all 1024 threads are not being executed. — Beau Bellamy
I have no idea what your kernel does, but if you use 1024 threads and write to SM using "SharedArray1[idx] = array1[idx];" you will definetly write out of range since size of array is 999 and idx can be of range [0-1023] — djmj

Roger Dahl Roger Dahl · Accepted Answer · 2012-05-04T03:20:46

You don't want to vary the number of threads per block. You should get the optimal number of threads per block for your kernel by using the CUDA Occupancy Calculator. After you have that number, you simply launch the number of blocks that are required to get the total number of threads that you need. If the number of threads that you need for a given case is not always a multiple of the threads per block, you add code in the top of your kernel to abort the unneeded threads. (if () return;). Then, you pass in the dimensions of your matrix either with extra parameters to the kernel or by using x and y grid dimensions, depending on which information is required in your kernel (I haven't studied it).

My guess is that the reason you're having trouble with 1024 threads is that, even though your GPU supports that many threads in a block, there is another limiting factor to the number of threads you can have in each block based on resource usage in your kernel. The limiting factor can be shared memory or register usage. The Occupancy Calculator will tell you which, though that information is only important if you want to optimize your kernel.

cuda threads and blocks

2 Answers