The guide in PDF form is at https://www.akkadia.org/drepper/cpumemory.pdf.
It is still generally excellent and highly recommended (by me, and I think by other performance-tuning experts). It would be cool if Ulrich (or anyone else) wrote a 2017 update, but that would be a lot of work (e.g. re-running the benchmarks). See also other x86 performance-tuning and SSE/asm (and C/C++) optimization links in the x86 tag wiki. (Ulrich's article isn't x86 specific, but most (all) of his benchmarks are on x86 hardware.)
The low level hardware details about how DRAM and caches work all still apply. DDR4 uses the same commands as described for DDR1/DDR2 (read/write burst). The DDR3/4 improvements aren't fundamental changes. AFAIK, all the arch-independent stuff still applies generally, e.g. to AArch64 / ARM32.
See also the Latency Bound Platforms section of this answer for important details about the effect of memory/L3 latency on single-threaded bandwidth: bandwidth <= max_concurrency / latency
, and this is actually the primary bottleneck for single-threaded bandwidth on a modern many-core CPU like a Xeon. But a quad-core Skylake desktop can come close to maxing out DRAM bandwidth with a single thread. That link has some very good info about NT stores vs. normal stores on x86. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? is a summary.
Thus Ulrich's suggestion in 6.5.8 Utilizing All Bandwidth about using remote memory on other NUMA nodes as well as your own, is counter-productive on modern hardware where memory controllers have more bandwidth than a single core can use. Well possibly you can imagine a situation where there's a net benefit to running multiple memory-hungry threads on the same NUMA node for low-latency inter-thread communication, but having them use remote memory for high bandwidth not-latency-sensitive stuff. But this is pretty obscure, normally just divide threads between NUMA nodes and have them use local memory. Per-core bandwidth is sensitive to latency because of max-concurrency limits (see below), but all the cores in one socket can usually more than saturate the memory controllers in that socket.
(usually) Don't use software prefetch
One major thing that's changed is that hardware prefetch is much better than on the Pentium 4 and can recognize strided access patterns up to a fairly large stride, and multiple streams at once (e.g. one forward / backward per 4k page). Intel's optimization manual describes some details of the HW prefetchers in various levels of cache for their Sandybridge-family microarchitecture. Ivybridge and later have next-page hardware prefetch, instead of waiting for a cache miss in the new page to trigger a fast-start. I assume AMD has some similar stuff in their optimization manual. Beware that Intel's manual is also full of old advice, some of which is only good for P4. The Sandybridge-specific sections are of course accurate for SnB, but e.g. un-lamination of micro-fused uops changed in HSW and the manual doesn't mention it.
The usual advice these days is to remove all SW prefetch from old code, and only consider putting it back in if profiling shows cache misses (and you're not saturating memory bandwidth). Prefetching both sides of the next step of a binary search can still help. e.g. once you decide which element to look at next, prefetch the 1/4 and 3/4 elements so they can load in parallel with loading/checking middle.
The suggestion to use a separate prefetch thread (6.3.4) is totally obsolete, I think, and was only ever good on Pentium 4. P4 had hyperthreading (2 logical cores sharing one physical core), but not enough trace-cache (and/or out-of-order execution resources) to gain throughput running two full computation threads on the same core. But modern CPUs (Sandybridge-family and Ryzen) are much beefier and should either run a real thread or not use hyperthreading (leave the other logical core idle so the solo thread has the full resources instead of partitioning the ROB).
Software prefetch has always been "brittle": the right magic tuning numbers to get a speedup depend on the details of the hardware, and maybe system load. Too early and it's evicted before the demand load. Too late and it doesn't help. This blog article shows code + graphs for an interesting experiment in using SW prefetch on Haswell for prefetching the non-sequential part of a problem. See also How to properly use prefetch instructions?. NT prefetch is interesting, but even more brittle because an early eviction from L1 means you have to go all the way to L3 or DRAM, not just L2. If you need every last drop of performance, and you can tune for a specific machine, SW prefetch is worth looking at for sequential access, but it may still be a slowdown if you have enough ALU work to do while coming close to bottlenecking on memory.
Cache line size is still 64 bytes. (L1D read/write bandwidth is very high, and modern CPUs can do 2 vector loads per clock + 1 vector store if it all hits in L1D. See How can cache be that fast?.) With AVX512, line size = vector width, so you can load/store an entire cache line in one instruction. Thus every misaligned load/store crosses a cache-line boundary, instead of every other for 256b AVX1/AVX2, which often doesn't slow down looping over an array that wasn't in L1D.
Unaligned load instructions have zero penalty if the address is aligned at runtime, but compilers (especially gcc) make better code when autovectorizing if they know about any alignment guarantees. Actually unaligned ops are generally fast, but page-splits still hurt (much less on Skylake, though; only ~11 extra cycles latency vs. 100, but still a throughput penalty).
As Ulrich predicted, every multi-socket system is NUMA these days: integrated memory controllers are standard, i.e. there is no external Northbridge. But SMP no longer means multi-socket, because multi-core CPUs are widespread. Intel CPUs from Nehalem to Skylake have used a large inclusive L3 cache as a backstop for coherency between cores. AMD CPUs are different, but I'm not as clear on the details.
Skylake-X (AVX512) no longer has an inclusive L3, but I think there's still a tag directory that lets it check what's cached anywhere on chip (and if so where) without actually broadcasting snoops to all the cores. SKX uses a mesh rather than a ring bus, with generally even worse latency than previous many-core Xeons, unfortunately.
Basically all of the advice about optimizing memory placement still applies, just the details of exactly what happens when you can't avoid cache misses or contention vary.
6.4.2 Atomic ops: the benchmark showing a CAS-retry loop as 4x worse than hardware-arbitrated lock add
does probably still reflect a maximum contention case. But in real multi-threaded programs, synchronization is kept to a minimum (because it's expensive), so contention is low and a CAS-retry loop usually succeeds without having to retry.
C++11 std::atomic
fetch_add
will compile to a lock add
(or lock xadd
if the return value is used), but an algorithm using CAS to do something that can't be done with a lock
ed instruction is usually not a disaster. Use C++11 std::atomic
or C11 stdatomic
instead of gcc legacy __sync
built-ins or the newer __atomic
built-ins unless you want to mix atomic and non-atomic access to the same location...
8.1 DWCAS (cmpxchg16b
): You can coax gcc into emitting it, but if you want efficient loads of just one half of the object, you need ugly union
hacks: How can I implement ABA counter with c++11 CAS?. (Don't confuse DWCAS with DCAS of 2 separate memory locations. Lock-free atomic emulation of DCAS isn't possible with DWCAS, but transactional memory (like x86 TSX) makes it possible.)
8.2.4 transactional memory: After a couple false starts (released then disabled by a microcode update because of a rarely-triggered bug), Intel has working transactional memory in late-model Broadwell and all Skylake CPUs. The design is still what David Kanter described for Haswell. There's a lock-elision way to use it to speed up code that uses (and can fall back to) a regular lock (especially with a single lock for all elements of a container so multiple threads in the same critical section often don't collide), or to write code that knows about transactions directly.
Update: and now Intel has disabled lock-elision on later CPUs (including Skylake) with a microcode update. The RTM (xbegin / xend) non-transparent part of TSX can still work if the OS allows it, but TSX in general is seriously turning into Charlie Brown's football.
7.5 Hugepages: anonymous transparent hugepages work well on Linux without having to manually use hugetlbfs. Make allocations >= 2MiB with 2MiB alignment (e.g. posix_memalign
, or an aligned_alloc
that doesn't enforce the stupid ISO C++17 requirement to fail when size % alignment != 0
).
A 2MiB-aligned anonymous allocation will use hugepages by default. Some workloads (e.g. that keep using large allocations for a while after making them) may benefit from
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
to get the kernel to defrag physical memory whenever needed, instead of falling back to 4k pages. (See the kernel docs). Use madvise(MADV_HUGEPAGE)
after making large allocations (preferably still with 2MiB alignment) to more strongly encourage the kernel to stop and defrag now. defrag = always
is too aggressive for most workloads and will spend more time copying pages around than it saves in TLB misses. (kcompactd could maybe be more efficient.)
BTW, Intel and AMD call 2M pages "large pages", with "huge" only used for 1G pages. Linux uses "hugepage" for everything larger than the standard size.
(32-bit mode legacy (non-PAE) page tables only had 4M pages as the next largest size, with only 2-level page tables with more compact entries. The next size up would have been 4G, but that's the whole address space, and that "level" of translation is the CR3 control register, not a page directory entry. IDK if that's related to Linux's terminology.)
Appendix B: Oprofile: Linux perf
has mostly superseded oprofile
. perf list
/ perf stat -e event1,event2 ...
has names for most of the useful ways to program HW performance counters.
perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,\
branches,branch-misses,instructions,uops_issued.any,\
uops_executed.thread,idq_uops_not_delivered.core -r2 ./a.out
A few years ago, the ocperf.py
wrapper was needed to translate event names into codes, but these days perf
has that functionality built-in.
For some examples of using it, see Can x86's MOV really be "free"? Why can't I reproduce this at all?.