Let me start by saying that I've read carefully all similar questions on SO:
- Determining threads per block and block per grid
- Threads per SM, threads per block
- CUDA Blocks and Threads
- Warps and optimal number of blocks
My intention is to try and calculate dynamically (rather than hardcoding values) for a feed-forward neural net library I am developing.
My data is not a square lattice (a matrix) as is often with most examples I've seen, it is instead two vectors producing a matrix, with unequal rows to columns:
float x[6] {1.f, 1.f, 0.f, 1.f, 1.f, 0.f};
thrust::device_vector<float> in_vec( x, x+6 );
float y[9] {1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f};
thrust::device_vector<float> w_vec( y, y+9 );
thrust::device_vector<float> o_wec(9);
thrust::device_vector<float> mtx_vec( 9 * 6 );
float * i_ptr = thrust::raw_pointer_cast( in_vec.data() );
float * w_ptr = thrust::raw_pointer_cast( w_vec.data() );
float * out_ptr = thrust::raw_pointer_cast( mtx_vec.data() );
dim3 threadsPerBlock(9,6);
dim3 numBlocks(1,1);
prop_mtx<<<numBlocks,threadsPerBlock>>>( w_ptr, i_ptr, out_ptr, 6 );
and the kernel:
__global__ void prop_mtx( float * w, float * i, float * o, int s )
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
o[y + x * s] = w[x] * i[y];
}
The reason why I've taken this approach is because it makes sense in ANN computation, when it comes to vector/matrix calculations. I'd like to keep this consistent, and AFAIK using a 2D grid for Weight * Input calculations is reasonable.
I have to compute my threads per block as a 2D with unequal numbers of threads in the grid.
I am ussing a GTX 660, which has:
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2047 MBytes
( 5) Multiprocessors, (192) CUDA Cores/MP: 960 CUDA Cores
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
I am trying to understand how I can deduce/compute the grid size, threads per block, and number of blocks.
Let us assume I have a weight vector of 800 items, and an input vector of 6500 items.
- Does this imply that what I really need, is a 2D grid of 800,6500? As far as I understand, anything else will provide incorrect results?
I know my maximum threads per block is 1024, but because its a 2D grid, it would more likely be:
dim3 threadPerBlock(X,Y);
Due to the fact that my grid is not a square matrix, I need to calculate the X, Y threads per block in a different way?
Or I need to deduce the number of blocks needed first?
Finally, since my thread warp size is 32,
- Does the minimum grid size, regardless of all other parameters need to be at least 32, or a multiple of 32? Do I need at least 32 threads per block, or a grid size where the smallest number is 32?
Any pseudo-code, or explanation of how I should go about this, would be greatly appreciated.
What I have tried, is to calculate my 2D grid size, by dividing my data by 32 wrap size. Then I considered calculating the grid threads by using the available SMs. For example
800 weights / 5 SM, = 160 x's per SM
6500 inputs / 5 SM, = 1300 y's per SM
But I didn't know what to do from there on. Finally, I considered finding the input-weight ratio first:
6500/800 = 8.125
Implying that using the 32 minimum grid size for X, Y would have to be multiplied by 8.125 * 32 Hence, my threadsPerBlock would be:
dim3 threadsPerBlock(32,260);
That is of course, 8320 threads per block, which far exceeds the 1024 per block.
So this is my issue: how do I not exceed the 1024 threads per block, whilst retaining the correct grid size of my data?
PS: My question is not about optimising the code, but understanding how to distribute the threads and grid data over the device.