9
votes

I am trying to vectorize a loop, computing dot product of a large float vectors. I am computing it in parallel, utilizing the fact that CPU has large amount of XMM registers, like this:

__m128* A, B;
__m128 dot0, dot1, dot2, dot3 = _mm_set_ps1(0);
for(size_t i=0; i<1048576;i+=4) {
    dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]);
    dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]);
    dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]);
    dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);
}
... // add dots, then shuffle/hadd result.

I heard that using prefetch instructions could help speedup things, as it could fetch further data "in background", while doing muls and adds on a data that is in cache. However i failed to find examples and explanations on how to use _mm_prefetch(), when, with what addresses, and what hits. Could you assist on this?

1
Usually hardware prefetch does a good job for sequential access, and you don't need software prefetch. See also What Every Programmer Should Know About Memory?.Peter Cordes
You might consider using 8 accumulators, though, instead of only 4, to better hide the latency of addps (especially on Skylake where it's 4 cycle latency). x86-64 has 16 xmm registers (or ymm with AVX). And especially if you're compiling with FMA so those mul/add operations can collapse into a single FMA. More details on another SSE/AVX dot-product question. This might not help if you're mostly memory bottlenecked, though.Peter Cordes
You probably want to use prefetchnta if A and B are large and won't be read again soon. You want to prefetch once per 64B cache line, and you'll need to tune how far ahead to prefetch. e.g. _mm_prefetch((char*)(A+64), _MM_HINT_NTA); and the same for B would prefetch 16*64 = 1024 bytes head of where you're loading, allowing for hiding some of the latency of a cache miss but still easily fitting in L1D. Read Ulrich Drepper's "What Every Programmer Should Know About Memory?" (the PDF in my earlier link) for more about caching and prefetch.Peter Cordes
@PeterCordes My experiments on Skylake X have convinced me not to recommend prefetchnta anymore unless you really really know what you're doing. And I emphasize the "really" part because even I cannot consider myself competent enough to use it properly without it backfiring on Skylake X.Mysticial
If you prefetchnta and it gets evicted before it is used (either because it's too early or because the other hyperthread is misbehaving), it gets evicted from all levels of cache and then refetched from memory on the actual read. The result is 2x the bandwidth cost and a massive slowdown on anything that's bandwidth-bound. At least prefetcht0 seems to get evicted into the higher level caches.Mysticial

1 Answers

20
votes

The short answer that probably works for perfectly linear streaming loops like yours is probably: don't use them at all, let the hardware prefetchers do the work.

Still, it's possible that you can speed things up with software prefetching, and here is the theory and some detail if you want to try...

Basically you call _mm_prefetch() on an address you'll need at some point in the future. It is similar in some respects to loading a value from memory and doing nothing with it: both bring the line into the L1 cache2, but the prefetch intrinsic, which under the covers is emitting specific prefetch instructions, has some advantages which make it suitable for prefetching.

It works at cache-line granularity1: you only need to issue one prefetch for each cache line: more is just a waste. That means that in general, you should try to unroll your loop enough so that you can issue only one prefetch per cache line. In the case of 16-byte __m128 values, that means unroll at least by 4 (which you've done, so you are good there).

Then simple prefetch each of your access streams by some PF_DIST distance ahead of the current calculation, something like:

for(size_t i=0; i<1048576;i+=4) {
    dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]);
    dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]);
    dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]);
    dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);
    _mm_prefetch(A + i + PF_A_DIST, HINT_A);
    _mm_prefetch(B + i + PF_B_DIST, HINT_B);
}

Here PF_[A|B]_DIST is the distance to prefetch ahead of the current iteration and HINT_ is the temporal hint to use. Rather than try to calculate the right distance value from first principles, I would simply determine good values of PF_[A|B]_DIST experimentally4. To reduce the search space, you can start by setting them both equal, since logically a similar distance is likely to be ideal. You might find that only prefetching one of the two streams is ideal.

It is very important that the ideal PF_DIST depends on the hardware configuration. Not just on the CPU model, but also on the memory configuration, including details such as the snooping mode for multi-socket systems. For example, the best value could be wildly different on client and server chips of the same CPU family. So you should run your tuning experiment on the actual hardware you are a targeting, as much as possible. If you target a variety of hardware, you can test on all the hardware and hopefully find a value that's good on all of them, or even consider compile-time or runtime dispatching depending on CPU type (not always enough, as above) or based on a runtime test. Now just relying on hardware prefetching is starting to sound a lot better, isn't it?

You can use the same approach to find the best HINT since the search space is small (only 4 values to try) - but here you should be aware than the difference between the different hints (particularly _MM_HINT_NTA) might only show as a performance difference in code that runs after this loop, since they affect how much data unrelated to this kernel remain in the cache.

You might also find that this prefetching doesn't help at all, since your access patterns are perfectly linear and likely to be handled well by the L2 stream prefetchers. Still there are some additional, more hardcode things you could try or consider:

  • You might investigate whether prefetching only at the start of 4K page boundaries helps3. This will complicate your loop structure: you'll probably need a nested loop to separate the "near edge of page" and "deep inside the page" cases in order to only issue the prefetches near page boundaries. You'll also want to make your input arrays page-aligned too, or else it gets even more complicated.
  • You can try disabling some/all of the hardware prefetchers. This is usually terrible for overall performance, but on a highly tuned load with software prefetching, you might see better performance by eliminating interference from hardware prefetching. Selecting disabling prefetching also gives you an important a key tool to help understand what's going on, even if you ultimately leave all the prefetchers enabled.
  • Make sure you are using huge pages, since for large contiguous blocks like this they are idea.
  • There are problems with prefetching at the beginning and end of your main calculation loop: at the start, you'll miss prefetching all data at the start of each array (within the initial PF_DIST window), and at the end of the loop you'll prefetch additional and PF_DIST beyond the end of your array. At best these waste fetch and instruction bandwidth, but they may also cause (ultimately discarded) page faults which may affect performance. You can fix both by special intro and outro loops to handle these cases.

I also highly recommend the 5-part blog post Optimizing AMD Opteron Memory Bandwidth, which describes optimizing a problem very similar to yours, and which covers prefetching in some detail (it gave a large boost). Now this is totally different hardware (AMD Opteron) which likely behaves differently to more recent hardware (and especially to Intel hardware if that's what you're using) - but the process of improvement is key and the author is an expert in the field.


1 It may actually work at something like 2-cache-line granularity depending on how it interacts with the adjacent cache line prefetcher(s). In this case, you may be able to get away with issuing half the number of prefetches: one every 128 bytes.

2 In the case of software prefetch, you can also select some other level of cache, using the temporal hint.

3 There is some indication that even with perfect streaming loads, and despite the presence of "next page prefetchers" in modern Intel hardware, page boundaries are still a barrier to hardware prefetching that can be partially alleviated by software prefetching. Maybe because software prefetch serves as a stronger hint that "Yes, I'm going to read into this page", or because software prefetch works at the virtual address level and necessarily involves the translation machinery, while L2 prefetching works at the physical level.

4 Note that the "units" of the PF_DIST value is sizeof(__mm128), i.e., 16 bytes due to the way I calculated the address.