How to extract bytes from an SSE2 __m128i structure?

Question

I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images).

I tried AVX, more modern SSE versions etc, but eventually decided SSE2 seems sufficient and has the most support for individual bytes - please correct me if I'm wrong.

I have two questions: first, what's the right way to load 128-bit registers? I think I'm supposed to pass the load intrinsics data aligned to multiples of 128, but will that work with 2D array code like this:

greys = aligned_alloc(16, xres * sizeof(int8_t*));

for (uint32_t x = 0; x < xres; x++)
{
    greys[x] = aligned_alloc(16, yres * sizeof(int8_t*));
}

(The code above assumes xres and yres are the same, and are powers of two). Does this turn into a linear, unbroken block in memory? Could I then, as I loop, just keep passing addresses (incrementing them by 128) to the SSE2 load intrinsics? Or does something different need to be done for 2D arrays like this one?

My second question: once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i ? Looking through the Intel Intrinsics Guide, instructions that convert a vector type to a scalar one are rare. The closest I've found is int _mm_movemask_epi8 (__m128i a) but I don't quite understand how to use it.

Oh, and one third question - I assumed _mm_load_si128 only loads signed bytes? And I couldn't find any other byte loading function, so I guess you're just supposed to subtract 128 from each and account for it later?

I know these are basic questions for SIMD experts, but I hope this one will be useful to beginners like me. And if you think my whole approach to the application is wrong, or I'd be better off with more modern SIMD extensions, I'd love to know. I'd just like to humbly warn I've never worked with assembly and all this bit-twiddling stuff requires a lot of explication if it's to help me.

Nevertheless, I'm grateful for any clarification available.

In case it makes a difference: I'm targeting a low-power i7 Skylake architecture. But it'd be nice to have the application run on much older machines too (hence SSE2).

Peter Cordes Peter Cordes · Accepted Answer · 2016-10-06T00:20:05

Least obvious question first:

once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i

Extract the low 64 bits to an integer with int64_t _mm_cvtsi128_si64x(__m128i), or the low 32 bits with int _mm_cvtsi128_si32 (__m128i a).

If you want other parts of the vector, not the low element, your options are:

Shuffle the vector to create a new __m128i with the data you want in the low element, and use the cvt intrinsics (MOVD or MOVQ in asm).
Use SSE2 int _mm_extract_epi16 (__m128i a, int imm8), or the SSE4.1 similar instructions for other element sizes such as _mm_extract_epi64(v, 1); (PEXTRB/W/D/Q) are not the fastest instructions, but if you only need one high element, they're about equivalent to a separate shuffle and MOVD, but smaller machine code.
_mm_store_si128 to an aligned temporary array and access the members: compilers will often optimize this into just a shuffle or pextr* instruction if you compile with -msse4.1 or -march=haswell or whatever. print a __m128i variable shows an example, including Godbolt compiler output showing _mm_store_si128 into an alignas(16) uint64_t tmp[2]
Or use union { __m128i v; int64_t i64[2]; } or something. Union-based type punning is legal in C99, but only as an extension in C++. This is compiles the same as a tmp array, and is generally not easier to read.

An alternative to the union that would also work in C++ would be memcpy(&my_int64_local, 8 + (char*)my_vector, 8); to extract the high half, but that seems more complicated and less clear, and more likely to be something a compiler wouldn't "see through". Compilers are usually pretty good about optimizing away small fixed-size memcpy when it's an entire variable, but this is just half of the variable.
If the whole high half of a vector can go directly into memory unmodified (instead of being needed in an integer register), a smart compiler might optimize to use MOVHPS to store the high half of a __m128i with the above union stuff.

Or you can use _mm_storeh_pi((__m64*)dst, _mm_castsi128_ps(vec)). That only requires SSE1, and is more efficient than SSE4.1 pextrq on most CPUs. But don't do this for a scalar integer you're about to use again right away; if SSE4.1 isn't available it's likely the compiler will actually MOVHPS and integer reload, which usually isn't optimal. (And some compilers like MSVC don't optimize intrinsics.)

Does this turn into a linear, unbroken block in memory?

No, it's an array of pointers to separate blocks of memory, introducing an extra level of indirection vs. a proper 2D array. Don't do that.

Make one large allocation, and do the index calculation yourself (using array[x*yres + y]).

And yes, load data from it with _mm_load_si128, or loadu if you need to load from an offset.

assumed _mm_load_si128 only loads signed bytes

Signed or unsigned isn't an inherent property of a byte, it's only how you interpret the bits. You use the same load intrinsic for loading two 64-bit elements, or a 128-bit bitmap.

Use intrinsics that are appropriate for your data. It's a little bit like assembly language: everything is just bytes, and the machine will do what you tell it with your bytes. It's up to you to choose a sequence of instructions / intrinsics that produces meaningful results.

The integer load intrinsics take __m128i* pointer args, so you have to use _mm_load_si128( (const __m128i*) my_int_pointer ) or similar. This looks like pointer aliasing (e.g. reading an array of int through a short *), which is Undefined Behaviour in C and C++. However, this is how Intel says you're supposed to do it, so any compiler that implements Intel's intrinsics is required to make this work correctly. gcc does so by defining __m128i with __attribute__((may_alias)).

See also Loading data for GCC's vector extensions which points out that you can use Intel intrinsics for GNU C native vector extensions, and shows how to load/store.

To learn more about SIMD with SSE, there are some links in the sse tag wiki, including some intro / tutorial links.

The x86 tag wiki has some good x86 asm / performance links.

How to extract bytes from an SSE2 __m128i structure?

1 Answers