Using CUDA Shared Memory to Improve Global Access Patterns

Question

I have the following kernel to get the magnitude of a bunch of vectors:

__global__ void norm_v1(double *in, double *out, int n)
{
    const uint i = blockIdx.x * blockDim.x + threadIdx.x;

    if (i < n)
    {
        double x = in[3*i], y = in[3*i+1], z = in[3*i+2];
        out[i] = sqrt(x*x + y*y + z*z);
    }
}

However due to the packing of in as [x0,y0,z0,...,xn,yn,zn] it performs poorly with the profiler indicating a 32% global load efficiency. Repacking the data as [x0, x1, ..., xn, y0, y1, ..., yn, z0, z1, ..., zn] improves things greatly (with the offsets for x, y, and z changing accordingly). Runtime is down and efficiency is up to 100%.

However, this packing is simply not practical for my application. I therefore wish to investigate the use of shared memory. My idea is for each thread in a block to copy three values (blockDim.x apart) from global memory -- yielding coalesced access. Under the assumption of a maximum blockDim.x = 256 I came up with:

#define BLOCKDIM 256

__global__ void norm_v2(double *in, double *out, int n)
{
    __shared__ double invec[3*BLOCKDIM];

    const uint i = blockIdx.x * blockDim.x + threadIdx.x;

    invec[0*BLOCKDIM + threadIdx.x] = in[0*BLOCKDIM+i];
    invec[1*BLOCKDIM + threadIdx.x] = in[1*BLOCKDIM+i];
    invec[2*BLOCKDIM + threadIdx.x] = in[2*BLOCKDIM+i];
    __syncthreads();

    if (i < n)
    {
        double x = invec[3*threadIdx.x];
        double y = invec[3*threadIdx.x+1];
        double z = invec[3*threadIdx.x+2];

        out[i] = sqrt(x*x + y*y + z*z);
    }
}

However this is clearly deficient when n % blockDim.x != 0, requires knowing the maximum blockDim in advance and generates incorrect results for out[i > 255] when tested with an n = 1024. How should I best remedy this?

CygnusX1 CygnusX1 · Accepted Answer · 2012-11-06T04:19:56

I think this can solve the out[i > 255] problem:

__shared__ double shIn[3*BLOCKDIM];

const uint blockStart = blockIdx.x * blockDim.x;

invec[0*blockDim.x+threadIdx.x] = in[ blockStart*3 + 0*blockDim.x + threadIdx.x];
invec[1*blockDim.x+threadIdx.x] = in[ blockStart*3 + 1*blockDim.x + threadIdx.x];
invec[2*blockDim.x+threadIdx.x] = in[ blockStart*3 + 2*blockDim.x + threadIdx.x];
__syncthreads();

double x = shIn[3*threadIdx.x];
double y = shIn[3*threadIdx.x+1];
double z = shIn[3*threadIdx.x+2];

out[blockStart+threadIdx.x] = sqrt(x*x + y*y + z*z);

As for n % blockDim.x != 0 I would suggest padding the input/output arrays with 0 to match the requirement.

If you dislike the BLOCKDIM macro - explore using extern __shared__ shArr[] and then passing 3rd parameter to kernel configuration:

norm_v2<<<gridSize,blockSize,dynShMem>>>(...)

the dynShMem is the dynamic shared memory usage (in bytes). This is extra shared memory pool with its size specified at run-time, where all extern __shared__ variables will be initially assigned to.

What GPU are you using? Fermi or Kepler might help your original code with their L1 caching.

If you don't want to pad your in array, or you end up doing similar trick somewhere else, you may want to consider implementing a device-side memcopy, something like this:

template <typename T>
void memCopy(T* destination, T* source, size_t numElements) {
    //assuming sizeof(T) is a multiple of sizeof(int)
    //assuming one-dimentional kernel (only threadIdx.x and blockDim.x matters) 
    size_t totalSize = numElements*sizeof(T)/sizeof(int);
    int* intDest = (int*)destination;
    int* intSrc = (int*)source;
    for (size_t i = threadIdx.x; i < totalSize; i += blockDim.x) {
        intDest[i] = intSrc[i];
    }
    __syncthreads();
}

It basically treats any array as an array of int-s and copy the data from one location to another. You may want to replace the underlying int type with double-s or long long int if you work with 64-bit types only.

Then you can replace the copying lines with:

memCopy(invec, in+blockStart*3, min(blockDim.x, n-blockStart));

Using CUDA Shared Memory to Improve Global Access Patterns

1 Answers