copy to the shared memory in cuda

Question

In CUDA programming, if we want to use shared memory, we need to bring the data from global memory to shared memory. Threads are used for transferring such data.

I read somewhere (in online resources) that it is better not to involve all the threads in the block for copying data from global memory to shared memory. Such idea makes sense that all the threads are not executed together. Threads in a warp execute together. But my concern is all the warps are not executed sequentially. Say, a block with threads is divided into 3 warps: war p0 (0-31 threads), warp 1 (32-63 threads), warp 2 (64-95 threads). It is not guaranteed that warp 0 will be executed first (am I right?).

So which threads should I use to copy the data from global to shared memory?

talonmies talonmies · Accepted Answer · 2013-03-18T07:03:32

To use a single warp to load a shared memory array, just do something like this:

__global__
void kernel(float *in_data)
{
    __shared__ float buffer[1024];

    if (threadIdx.x < warpSize) {
        for(int i = threadIdx; i  <1024; i += warpSize) {
            buffer[i] = in_data[i];
        }
    }
    __syncthreads();

    // rest of kernel follows
}

[disclaimer: written in browser, never tested, use at own risk]

The key point here is the use of __syncthreads() to ensure that all threads in the block wait until the warp performing the load to shared memory have finished the load. The code I posted used the first warp, but you can calculate a warp number by dividing the thread index within the block by the warpSize. I also assumed a one-dimensional block, it is trivial to compute the thread index in a 2D or 3D block, so I leave that as an exercise to the reader.

copy to the shared memory in cuda

2 Answers