I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling.
I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0)
From IACA output I guess the difference is that IVB uses port 1 and 5 for above instructions, while haswell only uses port 5.
I googled a bit, but couldn't find a explanation. Is haswell really slower with legacy SSE, or did I just hit some extreme corner case? Any suggestions to dodge this bullet (other than AVX2, which is a known option but due to updating toolchain to new version postponed for now)
General remarks or suggested improvements are also welcome.
// r8 and r9 are #bytes to go to the next line in resp. src and dest
// r12=3*r8 r13=3*r9
// load 8x8 bytes into 4 registers, bytes interleaved.
movq xmm1,[rcx]
movq xmm4,[rcx+2*r8]
PUNPCKLBW xmm1,xmm4 // 0 2 0 2 0 2
movq xmm7,[rcx+r8]
movq xmm6,[rcx+r12]
PUNPCKLBW xmm7,xmm6 // 1 3 1 3 1 3
movdqa xmm2,xmm1
punpcklbw xmm1,xmm7 // 0 1 2 3 0 1 2 3 in xmm1:xmm2
punpckhbw xmm2,xmm7
lea rcx,[rcx+4*r8]
// same for 4..7
movq xmm3,[rcx]
movq xmm5,[rcx+2*r8]
PUNPCKLBW xmm3,xmm5
movq xmm7,[rcx+r8]
movq xmm8,[rcx+r12]
PUNPCKLBW xmm7,xmm8
movdqa xmm4,xmm3
punpcklbw xmm3,xmm7
punpckhbw xmm4,xmm7
// now we join one "low" dword from XMM1:xmm2 with one "high" dword
// from XMM3:xmm4
movdqa xmm5,xmm1
pextrd eax,xmm3,0
pinsrd xmm5,eax,1
movq [rdx],xmm5
movdqa xmm5,xmm3
pextrd eax,xmm1,1
pinsrd xmm5,eax,0
movq [rdx+r9],xmm5
movdqa xmm5,xmm1
pextrd eax,xmm3,2
pinsrd xmm5,eax,3
MOVHLPS xmm6,xmm5
movq [rdx+2*r9],xmm6
movdqa xmm5,xmm3
pextrd eax,xmm1,3
pinsrd xmm5,eax,2
MOVHLPS xmm6,xmm5
movq [rdx+r13],xmm6
lea rdx,[rdx+4*r9]
movdqa xmm5,xmm2
pextrd eax,xmm4,0
pinsrd xmm5,eax,1
movq [rdx],xmm5
movdqa xmm5,xmm4
pextrd eax,xmm2,1
pinsrd xmm5,eax,0
movq [rdx+r9],xmm5
movdqa xmm5,xmm2
pextrd eax,xmm4,2
pinsrd xmm5,eax,3
MOVHLPS xmm6,xmm5
movq [rdx+2*r9],xmm6
movdqa xmm5,xmm4
pextrd eax,xmm2,3
pinsrd xmm5,eax,2
MOVHLPS xmm6,xmm5
movq [rdx+r13],xmm6
lea rdx,[rdx+4*r9]
purpose: It is really rotating images from a camera for image vision purposes . In some (heavier)apps the rotation is postponed and done display-only (opengl), in some it is easier to rotate input then to adapt algorithms.
updated code: I posted some final code here Speedup was very dependent on size of input. Large on small images, but still a factor two on larger ones compared to looptiling HLL code with a 32x32 tile. (same algo as asm code linked)
punpcklbw
, so you can't avoid SSE2 entirely and just use integershld
or something. But note that Haswell has 2 load ports but only 1 shuffle port. – Peter Cordes