I wrote a kernel that computes euclidean distances between a given D-dimensional vector q
(stored in constant memory) and an array pts
of N vectors (also D-dimensional).
The array layout in memory is such that the first N elements are the first coordinates of all N vectors, then a sequence of N second coordinates and so on.
Here is the kernel:
__constant__ float q[20];
__global__ void compute_dists(float *pt, float *dst,
int n, int d) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n;
i += gridDim.x * blockDim.x) {
float ld = 0;
for (int j = 0; j < d; ++j) {
float tmp = (q[j] - pts[j * n + i]);
ld += tmp * tmp;
}
dst[i] = ld;
}
}r
It is invoked as follows:
const int N = 1000000, D = 20;
compute_dists<<<32, 512>>>(vecs, dists, vec, N, D);
Now, profiling this kernel using NVIDIA Visual Profiler on Quadro K1000M results in warnings about
- High instruction replay overhead (31,2%) and
- High global memory instruction overhead (31,2%).
This is very surprising for me, because as far as I can tell the memory access is coalesced
(because j * n + i
is always a multiple of 32
for the first warp in a thread which gives us a 128-byte alignment) and there is no branch divergence..
Are there some other factors that contribute to instruction replay overhead metric or am I missing something else?