16
votes

We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory.

Looking at the results (compiled for 64-bit) on a few different machines, Skylake machines do significantly better than Broadwell-E, keeping OS (Win10-64), processor speed, and RAM speed (DDR4-2133) the same. We're not talking a few percentage points, but rather a factor of about 2. Skylake is configured dual-channel, and the results for Broadwell-E don't vary for dual/triple/quad-channel.

Any ideas why this might be happening? The code that follows is compiled in Release in VS2015, and reports average time to complete each memcpy at:

64-bit: 2.2ms for Skylake vs 4.5ms for Broadwell-E

32-bit: 2.2ms for Skylake vs 3.5ms for Broadwell-E.

We can get greater memory throughput on a quad-channel Broadwell-E build by utilizing multiple threads, and that's nice, but to see such a drastic difference for single-threaded memory access is frustrating. Any thoughts on why the difference is so pronounced?

We've also used various benchmarking software, and they validate what this simple example shows - single-threaded memory throughput is way better on Skylake.

#include <memory>
#include <Windows.h>
#include <iostream>

//Prevent the memcpy from being optimized out of the for loop
_declspec(noinline) void MemoryCopy(void *destinationMemoryBlock, void *sourceMemoryBlock, size_t size)
{
    memcpy(destinationMemoryBlock, sourceMemoryBlock, size);
}

int main()
{
    const int SIZE_OF_BLOCKS = 25000000;
    const int NUMBER_ITERATIONS = 100;
    void* sourceMemoryBlock = malloc(SIZE_OF_BLOCKS);
    void* destinationMemoryBlock = malloc(SIZE_OF_BLOCKS);
    LARGE_INTEGER Frequency;
    QueryPerformanceFrequency(&Frequency);
    while (true)
    {
        LONGLONG total = 0;
        LONGLONG max = 0;
        LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
        for (int i = 0; i < NUMBER_ITERATIONS; ++i)
        {
            QueryPerformanceCounter(&StartingTime);
            MemoryCopy(destinationMemoryBlock, sourceMemoryBlock, SIZE_OF_BLOCKS);
            QueryPerformanceCounter(&EndingTime);
            ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
            ElapsedMicroseconds.QuadPart *= 1000000;
            ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
            total += ElapsedMicroseconds.QuadPart;
            max = max(ElapsedMicroseconds.QuadPart, max);
        }
        std::cout << "Average is " << total*1.0 / NUMBER_ITERATIONS / 1000.0 << "ms" << std::endl;
        std::cout << "Max is " << max / 1000.0 << "ms" << std::endl;
    }
    getchar();
}
2
Does MSVC's memcpy library function select a strategy based on CPUID or anything? e.g. AVX loop vs. rep movsb? Did you make sure that both buffers are at least 64B-aligned for all tests? Did you check perf counters to see if you're getting any TLB misses, or just L3 cache misses? (Skylake can do two TLB walks in parallel). Is your Broadwell-E a multi-socket system (NUMA)?Peter Cordes
2.2ms to copy 23.8MiB is about 10.6GiB/s each of read and write, for mixed read+write. Intel says Skylake i5-6600 (and other SKL models using DDR4-2133) have a theoretical max memory bandwidth is of 34.1 GB/s (or 31.8 GiB/s). So even if every load and store misses in L3 and has to go to main memory, that's only about 2/3rds of the theoretical max. That may be normal for a single thread, though.Peter Cordes
On MSVC with intrinsic functions enabled, a call to memcpy will be inlined for buffer lengths that are compile-time constants. Otherwise, for 64-bit, it will generate a call to the library function, which itself calls the RtlCopyMemory API function. This is what would be happening in your case, since you've prevented the memcpy call from ever being inlined. And no, it does no fancy dispatching, just some sanity checks and rep movs.Cody Gray♦
Edited above to indicate metrics gathered compiled for 64-bit. I've actually tested about 3 Haswell/Broadwell-E and 3 Skylake machines, and every Skylake machine destroys Haswell/Broadwell-E in this metric. My Broadwell-E system is not NUMA. The CPU config in BIOS hasn't been tweaked (verified Hardware Prefetcher and Adjacent Cache Line Prefetch are both enabled). I'll take a look at the TLB/L3 cache misses on both system classes.aggieNick02
@PeterCordes i7-6800K, which is 6 cores/12 threads, at stock 3.4 GHzaggieNick02

2 Answers

14
votes

Single-threaded memory bandwidth on modern CPUs is limited by max_concurrency / latency of the transfers from L1D to the rest of the system, not by DRAM-controller bottlenecks. Each core has 10 Line-Fill Buffers (LFBs) which track outstanding requests to/from L1D. (And 16 "superqueue" entries which track lines to/from L2).

(Update: experiments show that Skylake probably has 12 LFBs, up from 10 in Broadwell. e.g. Fig7 in the ZombieLoad paper, and other performance experiments including @BeeOnRope's testing of multiple store streams)


Intel's many-core chips have higher latency to L3 / memory than quad-core or dual-core desktop / laptop chips, so single-threaded memory bandwidth is actually much worse on a big Xeon, even though the max aggregate bandwidth with many threads is much better. They have many more hops on the ring bus that connects cores, memory controllers, and the System Agent (PCIe and so on).

SKX (Skylake-server / AVX512, including the i9 "high-end desktop" chips) is really bad for this: L3 / memory latency is significantly higher than for Broadwell-E / Broadwell-EP, so single-threaded bandwidth is even worse than on a Broadwell with a similar core count. (SKX uses a mesh instead of a ring bus because that scales better, see this for details on both. But apparently the constant factors are bad in the new design; maybe future generations will have better L3 bandwidth/latency for small / medium core counts. The private per-core L2 is bumped up to 1MiB though, so maybe L3 is intentionally slow to save power.)

(Skylake-client (SKL) like in the question, and later quad/hex-core desktop/laptop chips like Kaby Lake and Coffee Lake, still use the simpler ring-bus layout. Only the server chips changed. We don't yet know for sure what Ice Lake client will do.)


A quad or dual core chip only needs a couple threads (especially if the cores + uncore (L3) are clocked high) to saturate its memory bandwidth, and a Skylake with fast DDR4 dual channel has quite a lot of bandwidth.

For more about this, see the Latency-bound Platforms section of this answer about x86 memory bandwidth. (And read the other parts for memcpy/memset with SIMD loops vs. rep movs/rep stos, and NT stores vs. regular RFO stores, and more.)

Also related: What Every Programmer Should Know About Memory? (2017 update on what's still true and what's changed in that excellent article from 2007).

3
votes

I finally got VTune (evalutation) up and running. It gives a DRAM bound score of .602 (between 0 and 1) on Broadwell-E and .324 on Skylake, with a huge part of the Broadwell-E delay coming from Memory Latency. Given that the memory sticks are the same speed (except dual-channel configured in Skylake and quad-channel in Broadwell-E), my best guess is that something about the memory controller in Skylake is just tremendously better.

It makes buying into the Broadwell-E architecture a much tougher call, and requires that you really need the extra cores to even consider it.

I also got L3/TLB miss counts. On Broadwell-E, TLB miss count was about 20% higher, and L3 miss count about 36% higher.

I don't think this is really an answer for "why" so I won't mark it as such, but is as close as I think I'll get to one for the time being. Thanks for all the helpful comments along the way.