SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

Question

I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling.

I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0)

From IACA output I guess the difference is that IVB uses port 1 and 5 for above instructions, while haswell only uses port 5.

I googled a bit, but couldn't find a explanation. Is haswell really slower with legacy SSE, or did I just hit some extreme corner case? Any suggestions to dodge this bullet (other than AVX2, which is a known option but due to updating toolchain to new version postponed for now)

General remarks or suggested improvements are also welcome.

   // r8 and r9 are #bytes to go to the next line in resp. src and dest 
   // r12=3*r8 r13=3*r9  
  // load 8x8 bytes into 4 registers, bytes interleaved.
  movq xmm1,[rcx]
  movq xmm4,[rcx+2*r8]
  PUNPCKLBW xmm1,xmm4   // 0 2 0 2 0 2
  movq xmm7,[rcx+r8]
  movq xmm6,[rcx+r12]
  PUNPCKLBW xmm7,xmm6   // 1 3 1 3 1 3

  movdqa xmm2,xmm1
  punpcklbw xmm1,xmm7   // 0 1 2 3 0 1 2 3 in xmm1:xmm2
  punpckhbw xmm2,xmm7   
  lea rcx,[rcx+4*r8]

  // same for 4..7

  movq xmm3,[rcx]
  movq xmm5,[rcx+2*r8]
  PUNPCKLBW xmm3,xmm5
  movq xmm7,[rcx+r8]
  movq xmm8,[rcx+r12]
  PUNPCKLBW xmm7,xmm8

  movdqa xmm4,xmm3
  punpcklbw xmm3,xmm7
  punpckhbw xmm4,xmm7

  // now we join one "low" dword from XMM1:xmm2 with one "high" dword
  // from XMM3:xmm4

  movdqa  xmm5,xmm1
  pextrd  eax,xmm3,0
  pinsrd  xmm5,eax,1
  movq    [rdx],xmm5

  movdqa  xmm5,xmm3
  pextrd  eax,xmm1,1
  pinsrd  xmm5,eax,0
  movq    [rdx+r9],xmm5

  movdqa  xmm5,xmm1
  pextrd  eax,xmm3,2
  pinsrd  xmm5,eax,3
  MOVHLPS  xmm6,xmm5
  movq    [rdx+2*r9],xmm6

  movdqa  xmm5,xmm3
  pextrd  eax,xmm1,3
  pinsrd  xmm5,eax,2
  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

  lea     rdx,[rdx+4*r9]

  movdqa  xmm5,xmm2
  pextrd  eax,xmm4,0
  pinsrd  xmm5,eax,1
  movq    [rdx],xmm5

  movdqa  xmm5,xmm4
  pextrd  eax,xmm2,1
  pinsrd  xmm5,eax,0
  movq    [rdx+r9],xmm5

  movdqa  xmm5,xmm2
  pextrd  eax,xmm4,2
  pinsrd  xmm5,eax,3
  MOVHLPS  xmm6,xmm5
  movq    [rdx+2*r9],xmm6

  movdqa  xmm5,xmm4
  pextrd  eax,xmm2,3
  pinsrd  xmm5,eax,2
  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

  lea     rdx,[rdx+4*r9]

purpose: It is really rotating images from a camera for image vision purposes . In some (heavier)apps the rotation is postponed and done display-only (opengl), in some it is easier to rotate input then to adapt algorithms.

updated code: I posted some final code here Speedup was very dependent on size of input. Large on small images, but still a factor two on larger ones compared to looptiling HLL code with a 32x32 tile. (same algo as asm code linked)

It's not really about legacy SSE, as IACA said it's due to Haswell removing a shuffle unit so they can only go to p5. — harold
Ok, searching for shuffle unit yields some results, like realworldtech.com/haswell-cpu/4 Seems can't be helped without going 256, iow AVX2. That said, Aki shaved off some cycles. — Marco van de Voort
Related but no true duplicate. Just count the number of loads. Source and destination format is not a degree of freedom in image manipulation. So e.g. gather operations are not usable (at all). Also it uses vex code. I will still keep it on file for the AVX2 version though, so thanks anyway. — Marco van de Voort
See also Packing two DWORDs into a QWORD to save store bandwidth. But you're doing some punpcklbw, so you can't avoid SSE2 entirely and just use integer shld or something. But note that Haswell has 2 load ports but only 1 shuffle port. — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2017-11-25T01:18:38

TL:DR: use punpckl/hdq to save a massive number of shuffles in the dword-rearranging step, exactly like the transpose code in A better 8x8 bytes matrix transpose with SSE?

Your memory layout requires storing the low / high 8 bytes of each vector result separately, which you can do efficiently with movq [rdx], xmm / movhps [rdx+r9], xmm.

The code is almost twice times as slow on Haswell/Skylake than on IVB

Your code bottlenecks heavily on shuffle throughput.

Haswell only has one shuffle execution unit, on port 5. SnB/IvB have 2 integer shuffle units (but still only one FP shuffle unit). See Agner Fog's instruction tables and optimization guide / microarch guide.

I see you already found David Kanter's excellent Haswell microarch write-up.

It's very easy to bottleneck on shuffle (or port5 in general) throughput for code like this, and it often gets worse with AVX / AVX2 because many shuffles are in-lane only. AVX for 128-bit ops might help, but I don't think you'll gain anything from shuffling into 256b vectors and then shuffling them apart again into 64-bit chunks. If you could load or store contiguous 256b chunks, it would be worth trying.

You have some simple missed-optimizations, even before we think about major changes:

  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

should be movhps [rdx+r13],xmm6. On Sandybridge and Haswell, movhps is a pure store uop, with no shuffle uop required.

pextrd eax,xmm3,0 is always worse than movd eax, xmm3; never use pextrd with an immediate 0. (Also, pextrd directly to memory might be a win. You could do a 64-bit movq and then overwrite half of that with a 32-bit pextrd. You might then bottleneck on store throughput. Also, on Sandybridge, indexed addressing modes don't stay micro-fused, so more stores would hurt your total uop throughput. But Haswell doesn't have that problem for stores, only for some indexed loads depending on the instruction.) If you use more stores some places and more shuffles other places, you could use more stores for the single-register addressing modes.

Source and destination format is not a degree of freedom in image manipulation.

Depends what you're doing. x264 (the open-source h.264 video encoder) copies 8x8 blocks into contiguous buffers before working with them repeatedly, so the stride between rows is an assemble-time constant.

This saves passing a stride in a register and doing stuff like you're doing with [rcx+2*r8] / [rcx+r8]. It also lets you load two rows with one movdqa. And it gives you good memory locality for accessing 8x8 blocks.

Of course it's probably not a win to spend time copying in/out of this format if this rotation is all you're doing with an 8x8 block of pixels. FFmpeg's h.264 decoder (which uses many of of the same asm primitives as x264) doesn't use this, but IDK if that's because nobody ever bothered to port the updated x264 asm or if it's just not worth it.

  // now we join one "low" dword from XMM1:xmm2 with one "high" dword
  // from XMM3:xmm4

extract/insert to/from integer is not very efficient; pinsrd and pextrd are 2 uops each, and one of those uops is a shuffle. You might even come out ahead of your current code just using pextrd to memory in 32-bit chunks.

Also consider using SSSE3 pshufb which can put your data anywhere it needs to be, and zero other elements. This can set you up for merging with por. (You might use pshufb instead of punpcklbw).

Another option is to use shufps to combine data from two sources. You might need another shuffle after that. Or use punpckldq.

// "low" dwords from XMM1:xmm2
//  high dwords from XMM3:xmm4

;  xmm1:  [ a b c d ]   xmm2: [ e f g h ]
;  xmm3:  [ i j k l ]   xmm4: [ m n o p ]

; want: [ a i b j ] / [ c k d l ] / ... I think.

;; original: replace these with
;  movdqa  xmm5,xmm1     ; xmm5 = [ a b c d ]
;  pextrd  eax,xmm3,0    ; eax = i
;  pinsrd  xmm5,eax,1    ; xmm5 = [ a i ... ]
;  movq    [rdx],xmm5

;  movdqa  xmm5,xmm3       ; xmm5 = [ i j k l ]
;  pextrd  eax,xmm1,1      ; eax = b
;  pinsrd  xmm5,eax,0      ; xmm5 = [ b j ... ]
;  movq    [rdx+r9],xmm5

Replace with this:

   movdqa    xmm5, xmm1
   punpckldq xmm5, xmm3     ; xmm5 = [ a i b j ]
   movq     [rdx], xmm5
   movhps   [rdx+r9], xmm5  ; still a pure store, doesn't cost a shuffle

So we've replaced 4 shuffle uops with 1, and lowered the total uop count from 12 fused-domain uops (Haswell) to 4. (Or on Sandybridge, from 13 to 5 because the indexed store doesn't stay micro-fused).

Use punpckhdq for [ c k d l ], where it's even better because we're replacing movhlps as well.

 ;  movdqa  xmm5,xmm1       ; xmm5 = [ a b c d ]
 ; pextrd  eax,xmm3,2      ; eax = k
 ; pinsrd  xmm5,eax,3      ; xmm5 = [ a b c k ]
 ; MOVHLPS  xmm6,xmm5      ; xmm6 = [ c k ? ? ]  (false dependency on old xmm6)
 ; movq   [rdx+2*r9],xmm6

Then unpack lo/hi for xmm2 and xmm4.

Using AVX or AVX2 would let you skip the movdqa, because you can unpack into a new destination register instead of copy + destroy.

SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

2 Answers