3
votes

I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.

I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.

template <typename T, typename U>
inline void accumulate_2x2_x_pass(
  T* channel, U* accum,
  const size_t sx, const size_t sy, 
  const size_t osx, const size_t osy,
  const size_t yoff, const size_t oyoff
) {

  const bool odd_x = (sx & 0x01);

  size_t i_idx, o_idx;

  // Should be vectorizable somehow...
  for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++) {
    i_idx = x + yoff;
    o_idx = ox + oyoff;
    accum[o_idx] += channel[i_idx];
    accum[o_idx] += channel[i_idx + 1];
  }

  if (odd_x) {
    // << 1 bc we need to multiply by two on the edge 
    // to avoid darkening during render
    accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;
  }
}

However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.

(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)

Thanks everyone. :)

1
Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.Peter Cordes
Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_tSapphireSun
I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.Peter Cordes
You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).Peter Cordes
On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.Peter Cordes

1 Answers

4
votes

The widening case with the narrow type T = uint8_t or uint16_t is probably best implemented with SSSE3 pmaddubsw or SSE2 pmaddwd with a multiplier of 1. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.

If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd* only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)

And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.

// SSSE3
__m128i hadd_widen8_to_16(__m128i a) {
                      // uint8_t, int8_t  (doesn't matter when multiplier is +1)
    return _mm_maddubs_epi16(a, _mm_set_epi8(1));
}

// SSE2
__m128i hadd_widen16_to_32(__m128i a) {
                   // int16_t, int16_t
    return _mm_madd_epi16(a, _mm_set_epi16(1));
}

These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.

Yes really, they're both _epi16. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw = unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd is packed multiply add word to dword, same naming scheme as punpcklwd etc.)


The T=U case with uint16_t or uint32_t is a a use-case for SSSE3 _mm_hadd_epi16 or _mm_hadd_epi32. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.

If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps (_mm_shuffle_ps + some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck

// UNTESTED

//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b) {
    __m128i a_shift = _mm_srli_epi64(a, 32);
    __m128i b_shift = _mm_srli_epi64(b, 32);
    a = _mm_add_epi32(a, a_shift);
    b = _mm_add_epi32(b, b_shift);
    __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
    return _mm_castps_si128(combined);
}

For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.

// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b) {
    __m256i hadd = _mm256_hadd_epi32(a, b);  // 2x in-lane hadd
    return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );
}

// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)
{
        __m256i a_shift = _mm256_srli_epi64(a, 32);  // useful result in the low half of each qword
        __m256i b_shift = _mm256_slli_epi64(b, 32);  // ... high half of each qword
        a = _mm256_add_epi32(a, a_shift);
        b = _mm256_add_epi32(b, b_shift);
        __m256i blended = _mm256_blend_epi32(a,b, 0b10101010);  // alternating low/high results
        return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0),  blended);
}

On Haswell and Skylake, hadd32_emulated_avx2 can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32 to sum into accum[] will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.

hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.

(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)