Packing two DWORDs into a QWORD to save store bandwidth

Question

Imagine a load-store loop like the following which loads DWORDs from non-contiguous locations and stores them contiguously:

top:
mov eax, DWORD [rsi]
mov DWORD [rdi], eax
mov eax, DWORD [rdx]
mov DWORD [rdi + 4], eax
; unroll the above a few times
; increment rdi and rsi somehow
cmp ...
jne top

On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one load).

One idea that naturally arises is to combine two DWORD loads into a single QWORD store which is possible since the stores are contiguous. Something like this could work:

top:
mov eax, DWORD [rsi]
mov ebx, DWORD [rdx]
shl rbx, 32
or  rax, rbx
mov QWORD [rdi]

Basically do the two loads and use two ALU ops to combine them into a single QWORD which we can store with a single store. Now we're bottlenecked on uops: 5 uops per 2 DWORDs - so 1.25 cycles per QWORD or 0.625 cycles per DWORD.

Already much better than the first option, but I can't help but think there is a better option for this shuffling - for example, we are wasting uop throughput by using plain loads - It feels like we should be able to combine at least some of the ALU ops with the loads with memory source operands, but I was mostly stymied on Intel: shl on memory only has a RMW form, and shlx and rolx don't micro-fuse.

It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.

I'm interested in scalar code, and code for both the base x86-64 instruction set and better versions if possible with useful extensions like BMI.

MMX punpckldq mm0, [mem] micro-fuses on SnB-family (including Skylake), so movd-load / punpckldq-load / movq-store is usable if loading a qword from one of the dword locations is ok for correctness and performance. — Peter Cordes
increment somehow isn't meant to imply that you're interleaving two contiguous arrays, is it? Obviously SSE2 punpckldq / hdq does that well. — Peter Cordes
@PeterCordes - well the point was to hide the details of the two arrays and avoid people using the fact that they are contiguous. In my application they aren't (it's a bit more complex than the above, but I was trying to keep the use-case simple). You could imagine that the unrolled versions have things like [rsi + 0x1234] as their source if you want (i.e., big jumps). I'm mostly interested in scalar code here. — BeeOnRope
Ok, that's what I thought. No opportunity to load data for multiple iterations with a vector and shuffle. — Peter Cordes
@JonathonReinhart that doesn't help, only one store can be executed per cycle so if they are not merged manually, it is already too late. — harold

Peter Cordes Peter Cordes · Accepted Answer · 2017-11-11T20:35:38

It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.

If wider loads are ok for correctness and performance (cache-line splits...), we can use shld

top:
    mov eax, DWORD [rsi]
    mov rbx, QWORD [rdx-4]     ; unaligned(?) 64-bit load

    shld rax, rbx, 32          ; 1 uop on Intel SnB-family, 0.5c recip throughput
    mov QWORD [rdi], rax

MMX punpckldq mm0, [mem] micro-fuses on SnB-family (including Skylake).

top:
    movd       mm0, DWORD [rsi]
    punpckldq  mm0, QWORD [rdx]     ; 1 micro-fused uop on Intel SnB-family

    movq       QWORD [rdi], mm0

 ; required after the loop, making it only worth-while for long-running loops
 emms

punpckl instructions unfortunately have a vector-width memory operand, not half-width. This often spoils them for uses where they'd otherwise be perfect (especially the SSE2 version where the 16B memory operand must be aligned). But note that the MMX versions (with only a qword memory operand) don't have an alignment requirement.

You could also use the 128-bit AVX version, but that's even more likely to cross a cache line boundary and be slow. (Skylake does not optimize by loading only the required 8 bytes; a loop with an aligned mov + vpunckldq xmm1, xmm0, [cache_line-8] runs at 1 iter per 2 clocks vs. 1 iter per clock for aligned.) The AVX version is required to fault if the 16-byte load crosses into an unmapped page, so it couldn't just use a narrower load without extra support from the load port. :/

Such a frustrating and useless design decision (presumably made before load ports could zero-extend for free, and not fixed with AVX). At least we have movhps as a replacement for memory-source punpcklqdq, but narrower widths that actually shuffle can't be replaced.

To avoid CL-splits, you could also use a separate movd load and punpckldq, or SSE4.1 pinsrd. With this, there's no reason for MMX.

top:
    movd       xmm0, DWORD [rsi]

    movd       xmm1, DWORD [rdx]           ; SSE2
    punpckldq  xmm0, xmm1
    ; or pinsrd  xmm0, DWORD [rdx], 1      ; 2 uops not micro-fused

    movq       QWORD [rdi], xmm0

Obviously AVX2 vpgatherdd is a possibility, and may perform well on Skylake.

Packing two DWORDs into a QWORD to save store bandwidth

1 Answers