We all know that GPGPU has several stream multiprocesssors(SM)
and each has a lot of stream processors(SP)
when talking about its hardware architecture. But it introduces another conceptions block
and thread
in NVDIA's CUDA
programming model.
And we also know that block
corresponds to SM
and thread
corresponds to SP
, When we launch a CUDA kernel
, we configure the kernel as kernel<<<blockNum, threadsNum>>>
. I have been writing CUDA
program like this for nearly two months. But I still have a lot of questions. A good programmer never just be satisfied with the no-bug program, they want to delve inside and know how the program behaves.
following questions:
Suppose a GPU has 14 SMs and each SM has 48 SPs, we have a kernel like this:
__global__ void double(int *data, int dataNum){ unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x; while(tid < dataNum){ data[tid] *= 2; tid += blockDim.x + blockIdx.x; } }
and
data
is an array of 1024 * 1024 int numbers, kernel configuration as <<<128, 512>>>, it means the grid has 512 * 128 threads and every kernel will iterate (1024 * 1024)/(512 * 128) = 16 times in its while loop. But there are only 14 * 48 SPs, which says that only 14 * 48 threads can simultaneously run no matter how many block numbers or thread numbers in your configuration, what's the point of blockNum and threadNum in the configuration, why not just<<<number of SMs, number of SPs>>>
.And is there any difference between
<<<128, 512>>>
and<<<64, 512>>>
, perhaps the former will iterate 16 times in it while loops and the letter 32 times, but the former has double blocks to schedule. Is there any way to know what's the best configuration, no just compare the result and choose the best, for we could not try every combination, so the result is not complete best, but the best in your attemps.we know only one block can run a SM one time, but where does the CUDA store the other blocks' context, suppose 512 blocks and 14 SMs, only 14 blocks have their contexts in SMs, how about the other 498 blocks' context?