0
votes

I'm profiling the following CUDA kernel

__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 && j < N2) {
        double a = pow(-1.0, (i+j)&1);

        data[j*blockDim.x*gridDim.x+i].x *= a;
        data[j*blockDim.x*gridDim.x+i].y *= a;
    }
 }

which basically multiplies a 2D double precision complex data matrix by a scalar double precision variable.

As it can be seen, I'm performing a coalesced global memory access and I want to verify this by the NVIDIA Visual Profiler by inspecting the global memory load and store efficiencies. Surprisingly, such efficiencies turn out to be both exactly 50%, far from the expected 100% for coalesced memory access. Is this related to the interlaced storage of real and imaginary parts for complex numbers? If so, is there any trick I could exploit to restore a 100% efficiency?

Thank you in advance.

ADDITIONAL INFORMATION

BLOCK_SIZE_x=16
BLOCK_SIZE_y=16

dim3 dimBlock2(BLOCK_SIZE_x,BLOCK_SIZE_y);
dim3 dimGrid2(N2/BLOCK_SIZE_x + (N2%BLOCK_SIZE_x == 0 ? 0:1),N1/BLOCK_SIZE_y + (N1%BLOCK_SIZE_y == 0 ? 0:1));

N1 and N2 can be arbitrary even numbers.

The card is an NVIDIA GT 540M.

2
More info needed. N1=? N2=? blockDim=? gridDim=? j*blockDim.x*gridDim.x+i looks odd. Could it be something like j*N1+i?kangshiyin
@talonmies I have added additional information in my revised post.Vitality
@EricShiyinKang I have added additional information in my revised post. I believe that the data indexing is fine.Vitality
@JackOLantern: Is this a Fermi or Kepler GPU?talonmies
@talonmies Provided in the revised post.Vitality

2 Answers

5
votes

Take a look at this NVIDIA Blog Post about efficiency of various memory access patterns. You're hitting Strided Memory Access problem.

Since each component is used independently you could treat your double2 array as a plain normal double array instead (just like Robert Crovella suggested).

__global__ void fftshift_2D(double *data, int N1, int N2)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 * 2 && j < N2) {
        double a = pow(-1.0, (i / 2 + j)&1);
        data[j*blockDim.x*gridDim.x+i] *= a;
    }
}

But if you ever need to access both x & y components in a single thread you could try:

Using 2 separate arrays. One with x component one with y component. Like that:

__global__ void fftshift_2D(double *dataX, double *dataY, int N1, int N2)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;

    if (i < N1 && j < N2) {
        double a = pow(-1.0, (i+j)&1);

        dataX[j*blockDim.x*gridDim.x+i] *= a;
        dataY[j*blockDim.x*gridDim.x+i] *= a;
    }
}

Or leaving the data layout as is but loading it with no stride into shared memory and reshuffling it from shared memory. That would look more or less like that:

__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
    __shared__ double buff[BLOCK_SIZE*2];
    double2 *buff2 = (double2 *) buff;
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    double ptr = (double *) &data[j*blockDim.x*gridDim.x + blockDim.x * blockIdx.x];

    // TODO add guarding with N1 & N2
    buff[threadIdx.x] = ptr[threadIdx.x];
    buff[blockDim.x + threadIdx.x] = ptr[blockDim.x + threadIdx.x];
    __syncthreads();

    double a = pow(-1.0, (i+j)&1);
    buff2[threadIdx.x].x *= a 
    buff2[threadIdx.x].y *= a 

    __syncthreads();
    ptr[threadIdx.x] = buff[threadIdx.x];
    ptr[blockDim.x + threadIdx.x] = buff[blockDim.x + threadIdx.x];
}
4
votes

Yes, since you have an array of structures data storage format, and you are referencing only every other element with the following line:

    data[j*blockDim.x*gridDim.x+i].x *= a;

then the global load and the global store that occurs as a result, will each only have 50% utilization. Note that I think the cache should help here, since you're referencing the alternate elements on the following line. But the load/store efficiency is still 50%.

I believe you could work around this (for this particular example) using some method to recast *data:

double *mydata = (double *)data;
...
mydata[2*(j*blockDim.x*gridDim.x)+i] *= a;

Please note I'm not trying to show exactly how to get the same coverage, just illustrate the idea. The code above is approximately what is needed, but you will need to tweak the code to make sure all the elements you want to be multiplied are handled correctly.