Understanding and optimising threads, blocks, and grids in pyCUDA

Question

I'm very new to GPU programming and pyCUDA and have a pretty fundamental gap in my knowledge. I have spent quite a bit of time searching SO, looking at example code and reading supporting documentation for CUDA/pyCUDA but haven't found much diversity in the explanations and can't get my head around a few things.

I am having trouble correctly defining block and grid dimensions. The code I am currently running is as follows, and aims to do element-wise multiplication of an array a by a float b:

from __future__ import division
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np

rows = 256
cols = 10
a = np.ones((rows, cols), dtype=np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)

b = np.float32(2)

mod = SourceModule("""
  __global__ void MatMult(float *a, float b)
  {
    const int i = threadIdx.x + blockDim.x * blockIdx.x;
    const int j = threadIdx.y + blockDim.y * blockIdx.y;
    int Idx = i + j*gridDim.x;
    a[Idx] *= b;
  }
  """)

func = mod.get_function("MatMult")

xBlock = np.int32(np.floor(1024/rows))
yBlock = np.int32(cols)

bdim = (xBlock, yBlock, 1)
dx, mx = divmod(rows, bdim[0])
dy, my = divmod(cols, bdim[1])
gdim = ( (dx + (mx>0)) * bdim[0], (dy + (my>0)) * bdim[1])
print "bdim=",bdim, ", gdim=", gdim

func(a_gpu, b, block=bdim, grid=gdim)
a_doubled = np.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled - 2*a

The code should print the block dimensions bdim and the grid dimensions gdim, as well as an array of zeroes.

This works for small array sizes, for example, if rows=256 and cols=10 (as in the example above) the output is as follows:

bdim= (4, 10, 1) , gdim= (256, 10)
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

However, if I increase rows=512, I get the following output:

bdim= (2, 10, 1) , gdim= (512, 10)
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 2.  2.  2. ...,  2.  2.  2.]
 [ 2.  2.  2. ...,  2.  2.  2.]
 [ 2.  2.  2. ...,  2.  2.  2.]]

Indicating that the multiplication is happening twice for some elements of the array.

However, if I force the block dimensions to bdim = (1,1,1), the problem no longer occurs and I get the following (correct) output for the larger array size:

bdim= (1, 1, 1) , gdim= (512, 10)
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

I don't understand this. What is happening here which means that this method of defining the block and grid dimensions is no longer appropriate as the array size is increased? Also, if block has dimensions (1,1,1) does this mean that the calculation is being performed serially?

Thanks in advance for any pointers and help!

pSoLT pSoLT · Accepted Answer · 2017-01-20T07:52:04

You operate on 2D grid of 2D blocks. In your kernel you seem to assume that gridDim.x would return number of threads in x dimension of a grid.

__global__ void MatMult(float *a, float b)
{
    const int i = threadIdx.x + blockDim.x * blockIdx.x;
    const int j = threadIdx.y + blockDim.y * blockIdx.y;
    int Idx = i + j*gridDim.x;
    a[Idx] *= b;
}

The gridDim.x returns number of blocks r x direction of grid, not number of threads. In order to obtain number of threads in given direction you should multiply number of threads in a block with number of blocks in a grid in the same direction:

int Idx = i + j * blockDim.x * gridDim.x

Understanding and optimising threads, blocks, and grids in pyCUDA

1 Answers