MOVSD performance depends on arguments

Question

I just noticed a pieces of my code exhibit different performance when copying memory. A test showed that a memory copying performance degraded if the address of destination buffer is greater than address of source. Sounds ridiculous, but the following code shows the difference (Delphi):

  const MEM_CHUNK = 50 * 1024 * 1024;
        ROUNDS_COUNT = 100;


  LpSrc := VirtualAlloc(0,MEM_CHUNK,MEM_COMMIT,PAGE_READWRITE);
  LpDest := VirtualAlloc(0,MEM_CHUNK,MEM_COMMIT,PAGE_READWRITE);

  QueryPerformanceCounter(LTick1);
  for i := 0 to ROUNDS_COUNT - 1 do
    CopyMemory(LpDest,LpSrc,MEM_CHUNK);
  QueryPerformanceCounter(LTick2);
    // show timings

  QueryPerformanceCounter(LTick1);
  for i := 0 to ROUNDS_COUNT - 1 do
    CopyMemory(LpSrc,LpDest,MEM_CHUNK);
  QueryPerformanceCounter(LTick2);
   // show timings

Here CopyMemory is based on MOVSD. The results :

Starting Memory Bandwidth Test...

LpSrc 0x06FC0000

LpDest 0x0A1C0000

src->dest Transfer: 5242880000 bytes in 1,188 sec @4,110 GB/s.

dest->src Transfer: 5242880000 bytes in 0,805 sec @6,066 GB/s.

src->dest Transfer: 5242880000 bytes in 1,142 sec @4,275 GB/s.

dest->src Transfer: 5242880000 bytes in 0,832 sec @5,871 GB/s.

Tried on two systems, the results are consistent no matter how many times repeated.

Never saw anything like that. Was unable to google it. Is this a known behavior? Is this just another cache-related peculiarity?

Update:

Here are the final results with page-aligned buffers and forward direction of MOVSD (DF=0):

Starting Memory Bandwidth Test...

LpSrc 0x06F70000

LpDest 0x0A170000

src->dest Transfer: 5242880000 bytes in 0,781 sec @6,250 GB/s.

dest->src Transfer: 5242880000 bytes in 0,731 sec @6,676 GB/s.

src->dest Transfer: 5242880000 bytes in 0,750 sec @6,510 GB/s.

dest->src Transfer: 5242880000 bytes in 0,735 sec @6,640 GB/s.

src->dest Transfer: 5242880000 bytes in 0,742 sec @6,585 GB/s.

dest->src Transfer: 5242880000 bytes in 0,750 sec @6,515 GB/s.

... and so on.

Here the transfer rates are constant.

Do both buffers have the same alignment? Could 4k aliasing be problem? Maybe in one direction the dst is at a slightly lower offset within a page then the src, so memory disambiguation can see that the loads couldn't be reloading the store. But the other way, it might falsely detect aliasing and reduce bandwidth. Have your code print the addresses. Also, what CPU hardware did you test on? Haswell? Skylake? Atom? Ryzen? K10? — Peter Cordes
What happens if you reverse them? Or add a Sleep between them? — David Wohlferd
Thank you for your suggestions. Changed allocation to VirtualAlloc for alignment. The output: — user4859735
@BeeOnRope: rep movsd is only fast with DF=0 (ascending addresses). I just checked on Skylake: 1000000 reps of copying 4096 non-overlapping bytes with rep movsb runs in 174M cycles with cld, vs. 4161M cycles with std, for page-aligned inputs or page-1 inputs (I tried both for downward, both were terrible). uops executed also confirms that it's spending many more uops when copying backwards. Your suggestion to copy backward is only viable if rep movsd is replaced with a SIMD loop. — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2019-07-23T03:04:38

Normally fast-strings or ERMSB microcode makes rep movsb/w/d/q and rep stosb/w/d/q fast for large counts (copying in 16, 32, or maybe even 64-byte chunks). And possibly with an RFO-avoiding protocol for the stores. (Other repe/repne scas/cmps are always slow).

Some conditions of the inputs can interfere with that best-case, notably having DF=1 (backward) instead of the normal DF=0.

rep movsd performance can depend on alignment of src and dst, including their relative misalignment. Apparently having both pointers = 32*n + same is not too bad, so most of the copy can be done after reaching an alignment boundary. (Absolute misalignment, but the pointers are aligned relative to each other. i.e. dst-src is a multiple of 32 or 64 bytes).

Performance does not depend on src > dst or src < dst per-se. If the pointers are within 16 or 32 byte of overlapping, that can also force a fall-back to 1 element at a time.

Intel's optimization manual has a section about memcpy implementations and comparing rep movs with well-optimized SIMD loops. Startup overhead is one of the the biggest downsides for rep movs, but so are misalignments that it doesn't handle well. (IceLake's "fast short rep" feature presumably addresses that.)

I did not disclose the CopyMemory body - and it indeed used copying backwards (df=1) when avoiding overlaps.

Yup, there's your problem. Only copy backwards if there would be actual overlap you need to avoid, not just based on which address is higher. And then do it with SIMD vectors, not rep movsd.

rep movsd is only fast with DF=0 (ascending addresses), at least on Intel CPUs. I just checked on Skylake: 1000000 reps of copying 4096 non-overlapping bytes from page-aligned buffers with rep movsb runs in:

174M cycles with cld (DF=0 forwards). about 42ms at about 4.1GHz, or about 90GiB/s L1d read+write bandwidth achieved. About 23 bytes per cycle, so startup overhead of each rep movsb seems to be hurting us. An AVX copy loop should achieve close to 32B/s with this easy case of pure L1d cache hits, even with a branch mispredict on loop exit from an inner loop.
4161M cycles with std (DF=1 backwards). about 1010ms at about 4.1GHz, or about 3.77GiB/s read+write. About 0.98 bytes / cycle, consistent with rep movsb being totally un-optimized. (1 count per cycle, so rep movsd would be about 4x that bandwidth with cache hits.)

uops_executed perf counter also confirms that it's spending many more uops when copying backwards. (This was inside a dec ebp / jnz loop in long mode under Linux. The same test loop as Can x86's MOV really be "free"? Why can't I reproduce this at all? built with NASM, with the buffers in the BSS. The loop did cld or std / 2x lea / mov ecx, 4096 / rep movsb. Hoisting cld out of the loop didn't make much difference.)

You were using rep movsd which copies 4 bytes at a time, so for backwards copying we can expect 4 bytes / cycle if they hit in cache. And you were probably using large buffers so cache misses bottleneck the forward direction to not much faster than backwards. But the extra uops from backward copy would hurt memory parallelism: fewer cache lines are touched by the load uops that fit in the out-of-order window. Also, some prefetchers work less well going backwards, in Intel CPUs. The L2 streamer works in either direction, but I think L1d prefetch only goes forward.

Related: Enhanced REP MOVSB for memcpy Your Sandybridge is too old for ERMSB, but Fast Strings for rep movs/rep stos has existed since original P6. Your Clovertown Xeon from ~2006 is pretty much ancient by today's standards. (Conroe/Merom microarchitecture). Those CPUs might be so old that a single core of a Xeon can saturate the meagre memory bandwidth, unlike today's many-core Xeons.

My buffers were page-aligned. For downward, I tried having the initial RSI/RDI point to the last byte of a page so the initial pointers were not aligned but the total region to be copied was. I also tried lea rdi, [buf+4096] so the starting pointers were page-aligned, so [buf+0] didn't get written. Neither made backwards copy any faster; rep movs is just garbage with DF=1; use SIMD vectors if you need to copy backwards.

Usually a SIMD vector loop can be at least as fast as rep movs, if you can use vectors as wide as the machine supports. That means having SSE, AVX, and AVX512 versions... In portable code without runtime dispatching to a memcpy implementation tuned for the specific CPU, rep movsd is often pretty good, and should be even better on future CPUs like IceLake.

You don't actually need page alignment for rep movs to be fast. IIRC, 32-byte aligned source and destination is sufficient. But also 4k aliasing could be a problem: if dst & 4095 is slightly higher than src & 4095, the load uops might internally have to wait some extra cycles for the store uops because the fast-path mechanism for detecting when a load is reloading a recent store only looks at page-offset bits.

Page alignment is one way to make sure you get the optimal case for rep movs, though.

Normally you get best performance from a SIMD loop, but only if you use SIMD vectors as wide as the machine supports (like AVX, or maybe even AVX512). And you should choose NT stores vs. normal depending on the hardware and the surrounding code.

MOVSD performance depends on arguments

1 Answers