When should we use prefetch?

Question

Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document. Although there is a comment in GCC's document, but it's too short to me.

I want to know, in prantice, when should we use prefetch? Are there some examples? Thx!

It's hard to get a boost from manual prefetching due to the existence of hardware prefetch. But here's an example of where it works: stackoverflow.com/questions/7327994/prefetching-examples — Mysticial
The first comment of the first answer there is genuinely reassuring: "no performance difference". :) — oakad
@oakad Oh yeah, I have tested the program on a Xeon 24 core CPU, enable prefix use 1.31 seconds and disable prefix use 1.32 seconds, no noticeable difference — 大宝剑
@oakad, KaiWen, that comment in the linked question was wrong, deducing from a single case that prefetches aren't useful in general is bad logic. I think my example below would have worked for him as well. — Leeor
For a full answer on when prefetching is useful and associated tradeoffs, see my survey paper on cache prefetching techniques. — user984260

Leeor Leeor · Accepted Answer · 2013-12-24T09:57:39

This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code.

Now the real question seems to be "when are prefetches useful", and the answer is - in any scenario where youre bounded on memory latency, and the access pattern isn't regular and distinguishable for the HW prefetch to capture (organized in a stream or strides), or when you suspect there are too many different streams for the HW to track simultaneously.
Most compilers would only very seldom insert their own prefetches for you, so it's basically up to you to play with your code and benchmark how prefetches could be useful.

The link by @Mysticial shows a nice example, but here's a more straight forward one that I think can't be caught by HW:

#include "stdio.h"
#include "sys/timeb.h"
#include "emmintrin.h"

#define N 4096
#define REP 200
#define ELEM int

int main() {
    int i,j, k, b;
    const int blksize = 64 / sizeof(ELEM);
    ELEM __attribute ((aligned(4096))) a[N][N];
    for (i = 0; i < N; ++i) {
        for (j = 0; j < N; ++j) {
            a[i][j] = 1;
        }
    }
    unsigned long long int sum = 0;
    struct timeb start, end;
    unsigned long long delta;

    ftime(&start);
    for (k = 0; k < REP; ++k) {
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; j ++) {
                sum += a[i][j];
            }
        }
    }
    ftime(&end);
    delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
    printf ("Prefetching off: N=%d, sum=%lld, time=%lld\n", N, sum, delta); 

    ftime(&start);
    sum = 0;
    for (k = 0; k < REP; ++k) {
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; j += blksize) {
                for (b = 0; b < blksize; ++b) {
                    sum += a[i][j+b];
                }
                _mm_prefetch(&a[i+1][j], _MM_HINT_T2);
            }
        }
    }
    ftime(&end);
    delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
    printf ("Prefetching on:  N=%d, sum=%lld, time=%lld\n", N, sum, delta); 
}

What I do here is traverse each matrix line (enjoying the HW prefetcher help with the consecutive lines), but prefetch ahead the element with the same column index from the next line that resides in a different page (which the HW prefetch should be hard pressed to catch). I sum the data just so that it's not optimized away, the important thing is that I basically just loop over a matrix, should have been pretty straightforward and simple to detect, and yet still get a speedup.

Built with gcc 4.8.1 -O3, it gives me an almost 20% boost on an Intel Xeon X5670:

Prefetching off: N=4096, sum=3355443200, time=1839
Prefetching on:  N=4096, sum=3355443200, time=1502

Note that the speedup is received even though I made the control flow more complicated (extra loop nesting level), the branch predictor should easily catch the pattern of that short block-size loop, and it saves execution of unneeded prefetches.

Note that Ivybridge and onward on should have a "next-page prefetcher", so the HW may be able to mitigate that on these CPUs (if anyone has one available and cares to try i'll be happy to know). In that case i'd modify the benchmark to sum every second line (and the prefetch would look ahead two lines everytime), that should confuse the hell out of the HW prefetchers.

Skylake results

Here are some results from a Skylake i7-6700-HQ, running at 2.6 GHz (no turbo) with gcc:

Compile flags: -O3 -march=native

Prefetching off: N=4096, sum=28147495993344000, time=896
Prefetching on:  N=4096, sum=28147495993344000, time=1222
Prefetching off: N=4096, sum=28147495993344000, time=886
Prefetching on:  N=4096, sum=28147495993344000, time=1291
Prefetching off: N=4096, sum=28147495993344000, time=890
Prefetching on:  N=4096, sum=28147495993344000, time=1234
Prefetching off: N=4096, sum=28147495993344000, time=848
Prefetching on:  N=4096, sum=28147495993344000, time=1220
Prefetching off: N=4096, sum=28147495993344000, time=852
Prefetching on:  N=4096, sum=28147495993344000, time=1253

Compile flags: -O2 -march=native

Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on:  N=4096, sum=28147495993344000, time=1813
Prefetching off: N=4096, sum=28147495993344000, time=1956
Prefetching on:  N=4096, sum=28147495993344000, time=1814
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on:  N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1961
Prefetching on:  N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1965
Prefetching on:  N=4096, sum=28147495993344000, time=1814

So using prefetch is either about 40% slower, or 8% faster depending on if you use -O3 or -O2 respectively for this particular example. The big slowdown for -O3 is actually due to a code generation quirk: at -O3 the loop without prefetch is vectorized, but the extra complexity of the prefetch variant loop prevents vectorization on my version of gcc anyway.

So the -O2 results are probably more apples-to-apples, and the benefit is about half (8% speedup vs 16%) of what we saw on Leeor's Westmere. Still it's worth noting that you have to be careful not to change code generation such that you get a big slowdown.

This test probably isn't ideal in that by going int by int implies a lot of CPU overhead rather than stressing the memory subsystem (that's why vectorization helped so much).

When should we use prefetch?

6 Answers

Investigation

Fitting more loads in the out-of-order window

In-order machines

Working around hardware prefetch restrictions

Counterpoint

Special load hints