I'm profiling the following CUDA kernel
__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
if (i < N1 && j < N2) {
double a = pow(-1.0, (i+j)&1);
data[j*blockDim.x*gridDim.x+i].x *= a;
data[j*blockDim.x*gridDim.x+i].y *= a;
}
}
which basically multiplies a 2D double precision complex data matrix by a scalar double precision variable.
As it can be seen, I'm performing a coalesced global memory access and I want to verify this by the NVIDIA Visual Profiler by inspecting the global memory load and store efficiencies. Surprisingly, such efficiencies turn out to be both exactly 50%, far from the expected 100% for coalesced memory access. Is this related to the interlaced storage of real and imaginary parts for complex numbers? If so, is there any trick I could exploit to restore a 100% efficiency?
Thank you in advance.
ADDITIONAL INFORMATION
BLOCK_SIZE_x=16
BLOCK_SIZE_y=16
dim3 dimBlock2(BLOCK_SIZE_x,BLOCK_SIZE_y);
dim3 dimGrid2(N2/BLOCK_SIZE_x + (N2%BLOCK_SIZE_x == 0 ? 0:1),N1/BLOCK_SIZE_y + (N1%BLOCK_SIZE_y == 0 ? 0:1));
N1 and N2 can be arbitrary even numbers.
The card is an NVIDIA GT 540M.
j*blockDim.x*gridDim.x+i
looks odd. Could it be something likej*N1+i
? – kangshiyin