Why does this SIMD code run slower than scalar equivalent?

Question

This is one of those n00b questions where I'm doing something wrong but I don't fully understand yet.

The xxhash32 algorithm has a nice 16 byte inner loop that can be made faster with SIMD, so, as an exercise to myself, this is what I'm trying to do.

The body of the loop looks like this (numBytes is some multiple of 16):

// C# that gets auto-vectorized.  uint4 is a vector of 4 elements
uint4 state = new uint4(Prime1 + Prime2, Prime2, 0, (uint)-Prime1) + seed;

int count = numBytes >> 4;
for (int i = 0; i < count; ++i) {
    state += *p++ * Prime2;
    state = (state << 13) | (state >> 19);
    state *= Prime1;
}

hash = rol(state.x, 1) + rol(state.y, 7) + rol(state.z, 12) + rol(state.w, 18);

I've translated this into the following SSE2/SSE4.1 intrinsics:

auto prime1 = _mm_set1_epi32(kPrime1);
auto prime2 = _mm_set1_epi32(kPrime2);

auto state = _mm_set_epi32(seed + kPrime1 + kPrime2, seed + kPrime2, seed, seed - kPrime1);

int32_t count = size >> 4;  // =/16
for (int32_t i = 0; i < count; i++) {
    state = _mm_add_epi32(state, _mm_mullo_epi32(_mm_loadu_si128(p128++), prime2));
    state = _mm_or_si128(_mm_sll_epi32(state, _mm_cvtsi32_si128(13)), _mm_srl_epi32(state, _mm_cvtsi32_si128(19)));
    state = _mm_mullo_epi32(state, prime1);
}

uint32_t temp[4];
_mm_storeu_si128(state, temp);
hash = _lrotl(temp[0], 1) + _lrotl(temp[1], 7) + _lrotl(temp[2], 12) + _lrotl(temp[3], 18);

Here's the disassembly of the inner loop body:

mov         rax,qword ptr [p128]  
mov         qword ptr [rsp+88h],rax  
mov         rax,qword ptr [rsp+88h]  
movdqu      xmm0,xmmword ptr [rax]  
movdqa      xmmword ptr [rsp+90h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+90h]  
movdqa      xmmword ptr [rsp+120h],xmm0  
mov         rax,qword ptr [p128]  
add         rax,10h  
mov         qword ptr [p128],rax  
movdqa      xmm0,xmmword ptr [prime2]  
movdqa      xmmword ptr [rsp+140h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+120h]  
movdqa      xmmword ptr [rsp+130h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+130h]  
pmulld      xmm0,xmmword ptr [rsp+140h]  
movdqa      xmmword ptr [rsp+150h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+150h]  
movdqa      xmmword ptr [rsp+160h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+160h]  
movdqa      xmmword ptr [rsp+170h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+20h]  
movdqa      xmmword ptr [rsp+100h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+100h]  
paddd       xmm0,xmmword ptr [rsp+170h]  
movdqa      xmmword ptr [rsp+180h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+180h]  
movdqa      xmmword ptr [rsp+190h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+190h]  
movdqa      xmmword ptr [rsp+20h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+20h]  
movdqa      xmmword ptr [rsp+1A0h],xmm0  
mov         eax,13h  
movd        xmm0,eax  
movdqa      xmmword ptr [rsp+1B0h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+1A0h]  
psrld       xmm0,xmmword ptr [rsp+1B0h]  
movdqa      xmmword ptr [rsp+1C0h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+1C0h]  
movdqa      xmmword ptr [rsp+200h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+20h]  
movdqa      xmmword ptr [rsp+1D0h],xmm0  
mov         eax,0Dh  
movd        xmm0,eax  
movdqa      xmmword ptr [rsp+1E0h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+1D0h]  
pslld       xmm0,xmmword ptr [rsp+1E0h]  
movdqa      xmmword ptr [rsp+1F0h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+1F0h]  
movdqa      xmmword ptr [rsp+210h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+200h]  
movdqa      xmmword ptr [rsp+230h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+210h]  
movdqa      xmmword ptr [rsp+220h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+220h]  
por         xmm0,xmmword ptr [rsp+230h]  
movdqa      xmmword ptr [rsp+240h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+240h]  
movdqa      xmmword ptr [rsp+250h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+250h]  
movdqa      xmmword ptr [rsp+20h],xmm0  
movdqa      xmm0,xmmword ptr [prime1]  
movdqa      xmmword ptr [rsp+280h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+20h]  
movdqa      xmmword ptr [rsp+270h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+270h]  
pmulld      xmm0,xmmword ptr [rsp+280h]  
movdqa      xmmword ptr [rsp+290h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+290h]  
movdqa      xmmword ptr [rsp+2A0h],xmm0  
movdqa      xmm0,xmmword ptr [rsp+2A0h]  
movdqa      xmmword ptr [rsp+20h],xmm0

Some questions about the disassembly:

Why so many movdqa instructions (I thought the purpose of intrinsics was that they mapped to specific hardware instructions.)?
Why is only xmm0 used, it looks to me like it is shuffling memory in and out of the vector pipeline (I'm expecting more xmmN registers to be used)?

This is compiled with Visual C++ 2017, I haven't enabled additional optimizations.

When I run these two snippets over a block of 64 MiB, many times over, the scalar code is about 3 timers faster. This is not what I expect to happen, what have I missed?

_mm_sll_epi32 with a vector 13 is probably part of the perf problem (once you enable optimization). Use _mm_slli_epi32 which uses the immediate form, pslld xmm, 13. MSVC is very literal about intrinsics and won't do this optimization for you. Extra shift uops are probably competing for some of the same execution ports that _mm_mullo_epi32 needs - it's unfortunately slowish. — Peter Cordes
IDK if it would be worth using SIMD to try emulate 4x _lrotl. Maybe only if you had AVX2 for variable-count shifts. Are you sure your loop counts are right, and you're not doing 4x as much work in the SIMD version? both use size >> 4;, but doing 4 elements per vector should mean you only do 1/4 the iterations. — Peter Cordes
Oh wait, both loops are just written like they were SIMD. The compiler may auto-vectorize the first. But the first doesn't even look like C++. uint4 state = new ... won't compile unless uint4 is actually a typedef for a pointer type. Perhaps that's C#? Probably best to indicate the different language with a comment, if you want to talk about the speed comparison. — Peter Cordes
@PeterCordes Yes, the first example is a C# reference implementation from Unity that when compiled with their Burst compiler infrastructure does auto vectorization. I'm trying to learn more about how that happens, and I wanna write some SIMD myself. — John Leidegren
Then you should compare the asm generated by that with what you're getting from manual vectorization. — Peter Cordes

John Leidegren John Leidegren · Accepted Answer · 2020-07-29T11:05:08

Okay, this has everything to do with compiler optimization flags and is totally Visual C++ specific.

As I enable additional compiler optimization switches the code gets so much faster.

The inner loop turns into this:

pmulld      xmm0,xmm5  
paddd       xmm0,xmm3  
movdqa      xmm3,xmm0  
pslld       xmm3,xmm2  
psrld       xmm0,xmm1  
por         xmm3,xmm0  
pmulld      xmm3,xmm4

While the documentation says that /Ox is equivalent to some other switches, it wasn't until I actually compiled with /Ox or /O2 that the code ended up looking like that.

Edit: the SIMD result ended up being just 8% faster. The xxhash32 algorithm is very good superscalar code so while I expected more, this is what I got. There's some notes about this in the original source.

Some numbers from my computer (Ryzen 1700).

memcpy 11.334895 GiB/s
SIMD    5.737743 GiB/s
Scalar  5.286924 GiB/s

I was hoping to try and make the xxhash32 algorithm almost as fast as memcpy. I've seen some benchmarks that suggest that this could be improved but it's difficult to compare without having a comparable baseline, that's why I bench against my computers memcpy performance.

Why does this SIMD code run slower than scalar equivalent?

1 Answers