Coalesced memory access and global memory load/store efficiency with complex arithmetics in CUDA

Question

I'm profiling the following CUDA kernel

__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 && j < N2) {
        double a = pow(-1.0, (i+j)&1);

        data[j*blockDim.x*gridDim.x+i].x *= a;
        data[j*blockDim.x*gridDim.x+i].y *= a;
    }
 }

which basically multiplies a 2D double precision complex data matrix by a scalar double precision variable.

As it can be seen, I'm performing a coalesced global memory access and I want to verify this by the NVIDIA Visual Profiler by inspecting the global memory load and store efficiencies. Surprisingly, such efficiencies turn out to be both exactly 50%, far from the expected 100% for coalesced memory access. Is this related to the interlaced storage of real and imaginary parts for complex numbers? If so, is there any trick I could exploit to restore a 100% efficiency?

Thank you in advance.

ADDITIONAL INFORMATION

BLOCK_SIZE_x=16
BLOCK_SIZE_y=16

dim3 dimBlock2(BLOCK_SIZE_x,BLOCK_SIZE_y);
dim3 dimGrid2(N2/BLOCK_SIZE_x + (N2%BLOCK_SIZE_x == 0 ? 0:1),N1/BLOCK_SIZE_y + (N1%BLOCK_SIZE_y == 0 ? 0:1));

N1 and N2 can be arbitrary even numbers.

The card is an NVIDIA GT 540M.

More info needed. N1=? N2=? blockDim=? gridDim=? j*blockDim.x*gridDim.x+i looks odd. Could it be something like j*N1+i? — kangshiyin
@talonmies I have added additional information in my revised post. — Vitality
@EricShiyinKang I have added additional information in my revised post. I believe that the data indexing is fine. — Vitality

Przemyslaw Zych Przemyslaw Zych · Accepted Answer · 2013-01-10T08:36:00

Take a look at this NVIDIA Blog Post about efficiency of various memory access patterns. You're hitting Strided Memory Access problem.

Since each component is used independently you could treat your double2 array as a plain normal double array instead (just like Robert Crovella suggested).

__global__ void fftshift_2D(double *data, int N1, int N2)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 * 2 && j < N2) {
        double a = pow(-1.0, (i / 2 + j)&1);
        data[j*blockDim.x*gridDim.x+i] *= a;
    }
}

But if you ever need to access both x & y components in a single thread you could try:

Using 2 separate arrays. One with x component one with y component. Like that:

__global__ void fftshift_2D(double *dataX, double *dataY, int N1, int N2)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 && j < N2) {
        double a = pow(-1.0, (i+j)&1);

        dataX[j*blockDim.x*gridDim.x+i] *= a;
        dataY[j*blockDim.x*gridDim.x+i] *= a;
    }
}

Or leaving the data layout as is but loading it with no stride into shared memory and reshuffling it from shared memory. That would look more or less like that:

__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
    __shared__ double buff[BLOCK_SIZE*2];
    double2 *buff2 = (double2 *) buff;
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    double ptr = (double *) &data[j*blockDim.x*gridDim.x + blockDim.x * blockIdx.x];

    // TODO add guarding with N1 & N2
    buff[threadIdx.x] = ptr[threadIdx.x];
    buff[blockDim.x + threadIdx.x] = ptr[blockDim.x + threadIdx.x];
    __syncthreads();

    double a = pow(-1.0, (i+j)&1);
    buff2[threadIdx.x].x *= a 
    buff2[threadIdx.x].y *= a 

    __syncthreads();
    ptr[threadIdx.x] = buff[threadIdx.x];
    ptr[blockDim.x + threadIdx.x] = buff[blockDim.x + threadIdx.x];
}

Coalesced memory access and global memory load/store efficiency with complex arithmetics in CUDA

2 Answers