Normally fast-strings or ERMSB microcode makes rep movsb/w/d/q
and rep stosb/w/d/q
fast for large counts (copying in 16, 32, or maybe even 64-byte chunks). And possibly with an RFO-avoiding protocol for the stores. (Other repe/repne scas/cmps
are always slow).
Some conditions of the inputs can interfere with that best-case, notably having DF=1 (backward) instead of the normal DF=0.
rep movsd
performance can depend on alignment of src and dst, including their relative misalignment. Apparently having both pointers = 32*n + same
is not too bad, so most of the copy can be done after reaching an alignment boundary. (Absolute misalignment, but the pointers are aligned relative to each other. i.e. dst-src
is a multiple of 32 or 64 bytes).
Performance does not depend on src > dst
or src < dst
per-se. If the pointers are within 16 or 32 byte of overlapping, that can also force a fall-back to 1 element at a time.
Intel's optimization manual has a section about memcpy implementations and comparing rep movs
with well-optimized SIMD loops. Startup overhead is one of the the biggest downsides for rep movs
, but so are misalignments that it doesn't handle well. (IceLake's "fast short rep
" feature presumably addresses that.)
I did not disclose the CopyMemory body - and it indeed used copying backwards (df=1) when avoiding overlaps.
Yup, there's your problem. Only copy backwards if there would be actual overlap you need to avoid, not just based on which address is higher. And then do it with SIMD vectors, not rep movsd
.
rep movsd
is only fast with DF=0 (ascending addresses), at least on Intel CPUs. I just checked on Skylake: 1000000 reps of copying 4096 non-overlapping bytes from page-aligned buffers with rep movsb
runs in:
- 174M cycles with
cld
(DF=0 forwards). about 42ms at about 4.1GHz, or about 90GiB/s L1d read+write bandwidth achieved. About 23 bytes per cycle, so startup overhead of each rep movsb
seems to be hurting us. An AVX copy loop should achieve close to 32B/s with this easy case of pure L1d cache hits, even with a branch mispredict on loop exit from an inner loop.
- 4161M cycles with
std
(DF=1 backwards). about 1010ms at about 4.1GHz, or about 3.77GiB/s read+write. About 0.98 bytes / cycle, consistent with rep movsb
being totally un-optimized. (1 count per cycle, so rep movsd
would be about 4x that bandwidth with cache hits.)
uops_executed
perf counter also confirms that it's spending many more uops when copying backwards. (This was inside a dec ebp / jnz
loop in long mode under Linux. The same test loop as Can x86's MOV really be "free"? Why can't I reproduce this at all? built with NASM, with the buffers in the BSS. The loop did cld
or std
/ 2x lea
/ mov ecx, 4096
/ rep movsb
. Hoisting cld
out of the loop didn't make much difference.)
You were using rep movsd
which copies 4 bytes at a time, so for backwards copying we can expect 4 bytes / cycle if they hit in cache. And you were probably using large buffers so cache misses bottleneck the forward direction to not much faster than backwards. But the extra uops from backward copy would hurt memory parallelism: fewer cache lines are touched by the load uops that fit in the out-of-order window. Also, some prefetchers work less well going backwards, in Intel CPUs. The L2 streamer works in either direction, but I think L1d prefetch only goes forward.
Related: Enhanced REP MOVSB for memcpy Your Sandybridge is too old for ERMSB, but Fast Strings for rep movs
/rep stos
has existed since original P6. Your Clovertown Xeon from ~2006 is pretty much ancient by today's standards. (Conroe/Merom microarchitecture). Those CPUs might be so old that a single core of a Xeon can saturate the meagre memory bandwidth, unlike today's many-core Xeons.
My buffers were page-aligned. For downward, I tried having the initial RSI/RDI point to the last byte of a page so the initial pointers were not aligned but the total region to be copied was. I also tried lea rdi, [buf+4096]
so the starting pointers were page-aligned, so [buf+0]
didn't get written. Neither made backwards copy any faster; rep movs
is just garbage with DF=1; use SIMD vectors if you need to copy backwards.
Usually a SIMD vector loop can be at least as fast as rep movs
, if you can use vectors as wide as the machine supports. That means having SSE, AVX, and AVX512 versions... In portable code without runtime dispatching to a memcpy
implementation tuned for the specific CPU, rep movsd
is often pretty good, and should be even better on future CPUs like IceLake.
You don't actually need page alignment for rep movs
to be fast. IIRC, 32-byte aligned source and destination is sufficient. But also 4k aliasing could be a problem: if dst & 4095
is slightly higher than src & 4095
, the load uops might internally have to wait some extra cycles for the store uops because the fast-path mechanism for detecting when a load is reloading a recent store only looks at page-offset bits.
Page alignment is one way to make sure you get the optimal case for rep movs
, though.
Normally you get best performance from a SIMD loop, but only if you use SIMD vectors as wide as the machine supports (like AVX, or maybe even AVX512). And you should choose NT stores vs. normal depending on the hardware and the surrounding code.
rep movsd
is only fast withDF=0
(ascending addresses). I just checked on Skylake: 1000000 reps of copying 4096 non-overlapping bytes withrep movsb
runs in 174M cycles withcld
, vs. 4161M cycles withstd
, for page-aligned inputs or page-1 inputs (I tried both for downward, both were terrible). uops executed also confirms that it's spending many more uops when copying backwards. Your suggestion to copy backward is only viable ifrep movsd
is replaced with a SIMD loop. – Peter Cordes