Imagine a load-store loop like the following which loads DWORD
s from non-contiguous locations and stores them contiguously:
top:
mov eax, DWORD [rsi]
mov DWORD [rdi], eax
mov eax, DWORD [rdx]
mov DWORD [rdi + 4], eax
; unroll the above a few times
; increment rdi and rsi somehow
cmp ...
jne top
On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one load).
One idea that naturally arises is to combine two DWORD
loads into a single QWORD
store which is possible since the stores are contiguous. Something like this could work:
top:
mov eax, DWORD [rsi]
mov ebx, DWORD [rdx]
shl rbx, 32
or rax, rbx
mov QWORD [rdi]
Basically do the two loads and use two ALU ops to combine them into a single QWORD
which we can store with a single store. Now we're bottlenecked on uops: 5 uops per 2 DWORD
s - so 1.25 cycles per QWORD
or 0.625 cycles per DWORD
.
Already much better than the first option, but I can't help but think there is a better option for this shuffling - for example, we are wasting uop throughput by using plain loads - It feels like we should be able to combine at least some of the ALU ops with the loads with memory source operands, but I was mostly stymied on Intel: shl
on memory only has a RMW form, and shlx
and rolx
don't micro-fuse.
It also seems like we could maybe get the shift for free by making the second load a QWORD
load offset by -4
, but then we are left clearing out garbage in the load DWORD
.
I'm interested in scalar code, and code for both the base x86-64 instruction set and better versions if possible with useful extensions like BMI
.
punpckldq mm0, [mem]
micro-fuses on SnB-family (including Skylake), somovd
-load /punpckldq
-load /movq
-store is usable if loading aqword
from one of thedword
locations is ok for correctness and performance. – Peter Cordesincrement somehow
isn't meant to imply that you're interleaving two contiguous arrays, is it? Obviously SSE2punpckldq
/hdq
does that well. – Peter Cordes