Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

Question

We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory.

Looking at the results (compiled for 64-bit) on a few different machines, Skylake machines do significantly better than Broadwell-E, keeping OS (Win10-64), processor speed, and RAM speed (DDR4-2133) the same. We're not talking a few percentage points, but rather a factor of about 2. Skylake is configured dual-channel, and the results for Broadwell-E don't vary for dual/triple/quad-channel.

Any ideas why this might be happening? The code that follows is compiled in Release in VS2015, and reports average time to complete each memcpy at:

64-bit: 2.2ms for Skylake vs 4.5ms for Broadwell-E

32-bit: 2.2ms for Skylake vs 3.5ms for Broadwell-E.

We can get greater memory throughput on a quad-channel Broadwell-E build by utilizing multiple threads, and that's nice, but to see such a drastic difference for single-threaded memory access is frustrating. Any thoughts on why the difference is so pronounced?

We've also used various benchmarking software, and they validate what this simple example shows - single-threaded memory throughput is way better on Skylake.

#include <memory>
#include <Windows.h>
#include <iostream>

//Prevent the memcpy from being optimized out of the for loop
_declspec(noinline) void MemoryCopy(void *destinationMemoryBlock, void *sourceMemoryBlock, size_t size)
{
    memcpy(destinationMemoryBlock, sourceMemoryBlock, size);
}

int main()
{
    const int SIZE_OF_BLOCKS = 25000000;
    const int NUMBER_ITERATIONS = 100;
    void* sourceMemoryBlock = malloc(SIZE_OF_BLOCKS);
    void* destinationMemoryBlock = malloc(SIZE_OF_BLOCKS);
    LARGE_INTEGER Frequency;
    QueryPerformanceFrequency(&Frequency);
    while (true)
    {
        LONGLONG total = 0;
        LONGLONG max = 0;
        LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
        for (int i = 0; i < NUMBER_ITERATIONS; ++i)
        {
            QueryPerformanceCounter(&StartingTime);
            MemoryCopy(destinationMemoryBlock, sourceMemoryBlock, SIZE_OF_BLOCKS);
            QueryPerformanceCounter(&EndingTime);
            ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
            ElapsedMicroseconds.QuadPart *= 1000000;
            ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
            total += ElapsedMicroseconds.QuadPart;
            max = max(ElapsedMicroseconds.QuadPart, max);
        }
        std::cout << "Average is " << total*1.0 / NUMBER_ITERATIONS / 1000.0 << "ms" << std::endl;
        std::cout << "Max is " << max / 1000.0 << "ms" << std::endl;
    }
    getchar();
}

Does MSVC's memcpy library function select a strategy based on CPUID or anything? e.g. AVX loop vs. rep movsb? Did you make sure that both buffers are at least 64B-aligned for all tests? Did you check perf counters to see if you're getting any TLB misses, or just L3 cache misses? (Skylake can do two TLB walks in parallel). Is your Broadwell-E a multi-socket system (NUMA)? — Peter Cordes
2.2ms to copy 23.8MiB is about 10.6GiB/s each of read and write, for mixed read+write. Intel says Skylake i5-6600 (and other SKL models using DDR4-2133) have a theoretical max memory bandwidth is of 34.1 GB/s (or 31.8 GiB/s). So even if every load and store misses in L3 and has to go to main memory, that's only about 2/3rds of the theoretical max. That may be normal for a single thread, though. — Peter Cordes
On MSVC with intrinsic functions enabled, a call to memcpy will be inlined for buffer lengths that are compile-time constants. Otherwise, for 64-bit, it will generate a call to the library function, which itself calls the RtlCopyMemory API function. This is what would be happening in your case, since you've prevented the memcpy call from ever being inlined. And no, it does no fancy dispatching, just some sanity checks and rep movs. — Cody Gray♦
Edited above to indicate metrics gathered compiled for 64-bit. I've actually tested about 3 Haswell/Broadwell-E and 3 Skylake machines, and every Skylake machine destroys Haswell/Broadwell-E in this metric. My Broadwell-E system is not NUMA. The CPU config in BIOS hasn't been tweaked (verified Hardware Prefetcher and Adjacent Cache Line Prefetch are both enabled). I'll take a look at the TLB/L3 cache misses on both system classes. — aggieNick02
@PeterCordes i7-6800K, which is 6 cores/12 threads, at stock 3.4 GHz — aggieNick02

Peter Cordes Peter Cordes · Accepted Answer · 2017-12-13T06:58:29

Single-threaded memory bandwidth on modern CPUs is limited by max_concurrency / latency of the transfers from L1D to the rest of the system, not by DRAM-controller bottlenecks. Each core has 10 Line-Fill Buffers (LFBs) which track outstanding requests to/from L1D. (And 16 "superqueue" entries which track lines to/from L2).

(Update: experiments show that Skylake probably has 12 LFBs, up from 10 in Broadwell. e.g. Fig7 in the ZombieLoad paper, and other performance experiments including @BeeOnRope's testing of multiple store streams)

Intel's many-core chips have higher latency to L3 / memory than quad-core or dual-core desktop / laptop chips, so single-threaded memory bandwidth is actually much worse on a big Xeon, even though the max aggregate bandwidth with many threads is much better. They have many more hops on the ring bus that connects cores, memory controllers, and the System Agent (PCIe and so on).

SKX (Skylake-server / AVX512, including the i9 "high-end desktop" chips) is really bad for this: L3 / memory latency is significantly higher than for Broadwell-E / Broadwell-EP, so single-threaded bandwidth is even worse than on a Broadwell with a similar core count. (SKX uses a mesh instead of a ring bus because that scales better, see this for details on both. But apparently the constant factors are bad in the new design; maybe future generations will have better L3 bandwidth/latency for small / medium core counts. The private per-core L2 is bumped up to 1MiB though, so maybe L3 is intentionally slow to save power.)

(Skylake-client (SKL) like in the question, and later quad/hex-core desktop/laptop chips like Kaby Lake and Coffee Lake, still use the simpler ring-bus layout. Only the server chips changed. We don't yet know for sure what Ice Lake client will do.)

A quad or dual core chip only needs a couple threads (especially if the cores + uncore (L3) are clocked high) to saturate its memory bandwidth, and a Skylake with fast DDR4 dual channel has quite a lot of bandwidth.

For more about this, see the Latency-bound Platforms section of this answer about x86 memory bandwidth. (And read the other parts for memcpy/memset with SIMD loops vs. rep movs/rep stos, and NT stores vs. regular RFO stores, and more.)

Also related: What Every Programmer Should Know About Memory? (2017 update on what's still true and what's changed in that excellent article from 2007).

Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

2 Answers