4
votes

We all know that GPGPU has several stream multiprocesssors(SM) and each has a lot of stream processors(SP) when talking about its hardware architecture. But it introduces another conceptions block and thread in NVDIA's CUDA programming model.

And we also know that block corresponds to SM and thread corresponds to SP, When we launch a CUDA kernel, we configure the kernel as kernel<<<blockNum, threadsNum>>>. I have been writing CUDA program like this for nearly two months. But I still have a lot of questions. A good programmer never just be satisfied with the no-bug program, they want to delve inside and know how the program behaves.

following questions:

  1. Suppose a GPU has 14 SMs and each SM has 48 SPs, we have a kernel like this:

    __global__ void double(int *data, int dataNum){
         unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
         while(tid < dataNum){
             data[tid] *= 2;
             tid += blockDim.x + blockIdx.x;
         }
    }
    

    and data is an array of 1024 * 1024 int numbers, kernel configuration as <<<128, 512>>>, it means the grid has 512 * 128 threads and every kernel will iterate (1024 * 1024)/(512 * 128) = 16 times in its while loop. But there are only 14 * 48 SPs, which says that only 14 * 48 threads can simultaneously run no matter how many block numbers or thread numbers in your configuration, what's the point of blockNum and threadNum in the configuration, why not just <<<number of SMs, number of SPs>>>.

  2. And is there any difference between <<<128, 512>>> and <<<64, 512>>>, perhaps the former will iterate 16 times in it while loops and the letter 32 times, but the former has double blocks to schedule. Is there any way to know what's the best configuration, no just compare the result and choose the best, for we could not try every combination, so the result is not complete best, but the best in your attemps.

  3. we know only one block can run a SM one time, but where does the CUDA store the other blocks' context, suppose 512 blocks and 14 SMs, only 14 blocks have their contexts in SMs, how about the other 498 blocks' context?

1

1 Answers

2
votes

And we also know that block corresponds to SM and thread corresponds to SP

This is incorrect. An SM can process multiple blocks simultaneously and an SP can process multiple threads simultaneously.

1) I think your question may be due to not separating between the work that an application needs to have done and the resources available to do that work. When you launch a kernel, you specify the work you want to have done. The GPU then uses its resources to perform the work. The more resources a GPU has, the more work it can do in parallel. Any work that can not be done in parallel is done in serial.

By letting the programmer specify the work that needs to be done without tying it to the amount of resources available on a given GPU, CUDA provides an abstraction that allows the app to seamlessly scale to any GPU.

But there are only 14 * 48 SPs, which says that only 14 * 48 threads can simultaneously run no matter how many block numbers or thread numbers in your configuration

SPs are pipelined, so they process many threads simultaneously. Each thread is in a different stage of completion. Each SP can start one operation and yield the result of one operation per clock. Though, as you can see now, even if your statement was correct, it wouldn't lead to your conclusion.

2) Threads in a block can cooperate with each other using shared memory. If your app is not using shared memory, the only implication of block size is performance. Initially, you can get a good value for the block size by using the occupancy calculator. After that, you can further fine tune your block size for performance by testing with different sizes. Since threads are run in groups of 32, called warps, you want to have the block size be divisible by 32. So there are not that many block sizes to test.

3) An SM can run a number of blocks at the same time. The number of blocks depends on how many resources each block requires and how many resources the SM has. A block uses a number of different resources and one of the resources becomes the limiting factor in how many blocks will run simultaneously. The occupancy calculator will tell you what the limiting factor is.

Only blocks that run simultaneously consume resources on an SM. I think those resources are what you mean by context. Blocks that are completed and blocks that are not yet started do not consume resources on an SM.