High global memory instruction overhead - no idea where it comes from

Question

I wrote a kernel that computes euclidean distances between a given D-dimensional vector q (stored in constant memory) and an array pts of N vectors (also D-dimensional).

The array layout in memory is such that the first N elements are the first coordinates of all N vectors, then a sequence of N second coordinates and so on.

Here is the kernel:

__constant__ float q[20];

__global__ void compute_dists(float *pt, float *dst,
        int n, int d) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; 
             i += gridDim.x * blockDim.x) {
        float ld = 0;
        for (int j = 0; j < d; ++j) {
            float tmp = (q[j] - pts[j * n + i]); 
            ld += tmp * tmp;
        }
        dst[i] = ld;
    }
}r

It is invoked as follows:

const int N = 1000000, D = 20;
compute_dists<<<32, 512>>>(vecs, dists, vec, N, D);

Now, profiling this kernel using NVIDIA Visual Profiler on Quadro K1000M results in warnings about

High instruction replay overhead (31,2%) and
High global memory instruction overhead (31,2%).

This is very surprising for me, because as far as I can tell the memory access is coalesced (because j * n + i is always a multiple of 32 for the first warp in a thread which gives us a 128-byte alignment) and there is no branch divergence..

Are there some other factors that contribute to instruction replay overhead metric or am I missing something else?

Please post a full reproducible. There are several reasons other than address divergence that can cause memory instructions to replay. The profilers do not provide information on the other reasons. Nsight VSE CUDA Profiler Memory Transaction experiment can show the transactions for each memory instruction in the source view. — Greg Smith
Has been a long time since I've done this. So I may be completely off but why is the innermost loop the dimensions rather than the thread index. Wouldn't that cause you to keep on switching memory streams for no reason? — ali-hussain
There are enough syntax errors in the code you have posted to suggest that this isn't the code you are actually running. Could you post the actual code you are asking about? — talonmies

laszlo.endre laszlo.endre · Accepted Answer · 2013-06-08T23:01:27

I think you have the problem of high TLB (Translation Lookaside Buffer) miss rate which comes from "pts[j * n + i]". Consecutive j-th elements have a high probability of not being present in the loaded memory page, since n is large. The TLB hardware has a high latency of loading the information where the page for the given memory location is. This leads to a memory load instruction replays. Every memory load instruction is reissued if the data is not present in the cache or if the page is not loaded into the TLB. Although I'm not entirely sure about the latter, this might be the case. Hope it helps. I have the same problem, but with a more serious, 97% replay. My question might answer yours as well.

High global memory instruction overhead - no idea where it comes from

1 Answers