I have the following kernel to get the magnitude of a bunch of vectors:
__global__ void norm_v1(double *in, double *out, int n)
{
const uint i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
{
double x = in[3*i], y = in[3*i+1], z = in[3*i+2];
out[i] = sqrt(x*x + y*y + z*z);
}
}
However due to the packing of in as [x0,y0,z0,...,xn,yn,zn] it performs poorly with the profiler indicating a 32% global load efficiency. Repacking the data as [x0, x1, ..., xn, y0, y1, ..., yn, z0, z1, ..., zn] improves things greatly (with the offsets for x, y, and z changing accordingly). Runtime is down and efficiency is up to 100%.
However, this packing is simply not practical for my application. I therefore wish to investigate the use of shared memory. My idea is for each thread in a block to copy three values (blockDim.x apart) from global memory -- yielding coalesced access. Under the assumption of a maximum blockDim.x = 256 I came up with:
#define BLOCKDIM 256
__global__ void norm_v2(double *in, double *out, int n)
{
__shared__ double invec[3*BLOCKDIM];
const uint i = blockIdx.x * blockDim.x + threadIdx.x;
invec[0*BLOCKDIM + threadIdx.x] = in[0*BLOCKDIM+i];
invec[1*BLOCKDIM + threadIdx.x] = in[1*BLOCKDIM+i];
invec[2*BLOCKDIM + threadIdx.x] = in[2*BLOCKDIM+i];
__syncthreads();
if (i < n)
{
double x = invec[3*threadIdx.x];
double y = invec[3*threadIdx.x+1];
double z = invec[3*threadIdx.x+2];
out[i] = sqrt(x*x + y*y + z*z);
}
}
However this is clearly deficient when n % blockDim.x != 0, requires knowing the maximum blockDim in advance and generates incorrect results for out[i > 255] when tested with an n = 1024. How should I best remedy this?