Implict SSE/AVX loads/stores and the stack

Question

I recently stumbled onto implicit SSE/AVX loads/stores. I thought these were some special extensions of GCC but then realized they work on MSVC as well.

__m128 a = *(__m128*)data    // same as __m128 a = _mm_load_ps(data)?
__m128 *b = (__m128*)result; // same as _mm_store_ps(result, a)?

What's the proper syntax for these implicit loads/stores?

From what I have read (Addressing a non-integer address, and sse) the implicit load/stores use aligned loads/stores so the memory has to be properly aligned. Is it fair to assume they work the same for most compilers (GCC/ICC/MSVC/Clang/MinGW,...) that support the SSE/AVX intrinsics? What's the motivation for having these implicit load/stores?

My next set of questions is in regards to push and popping SSE/AVX registers to the stack. How is this implemented? What if the stack is not 16-byte aligned? Does it then use unaligned load/stores? As I understand the stack is usually 16 byte aligned now but not necessarily 32 byte aligned (at least in 64-bit mode). If an algorithm has high AVX occupancy and needs to frequently push AVX registers on to the stack would it make sense to align the stack to 32 bytes (e.g. in GCC with mpreferred-stack-boundary) for potentially increased performance?

I use them a lot in macros where I could pass in arbitrary pointer types. If it's not aligned, you simply get a misalignment fault. Compilers should already be aligning the stack properly for whatever SIMD it's using. — Mysticial
The code or the comment on the second line is incorrect. _mm_store_ps(result, a) should be equivalent to __m128 *result = (__m128*)a. The signature of _mm_store_ps is void _mm_store_ps (float* mem_addr, __m128 a), where mem_addr must be aligned to 16-byte boundary. — plasmacel
@plasmacel, you're right. Although. I personally would never use implicit SSE/AVX loads/stores. — Z boson

klm123 klm123 · Accepted Answer · 2013-11-02T17:13:49

What you do here is reinterpreting memory as the one which is filled by __m128 variables. This works because __m128 basically is 4 floats (4 integers, or 2 doubles, or ...) written a memory consecutively. So you can treat it as an float array. The only difference is that __m128 is aligned on 16 bytes, meanwhile as float array is guarantied to be aligned on 4 bites only.

It is better to use reinterpret_cast for this reinterpretation:

  // sqrt calculation : b = sqrt(a)
const int N = 1000; // N%4 has to be equal 0!
float a[N] __attribute__((aligned(16))); // Input. Force 16 bytes alignment.
float b[N] __attribute__((aligned(16))); // Result.

for(int i=0; i<N; i+=4) {
  __m128 &aVec = reinterpret_cast<__m128&>(a[i]);
  __m128 &bVec = reinterpret_cast<__m128&>(c_simd[i]);
  bVec = _mm_sqrt_ps(aVec);
}

Implict SSE/AVX loads/stores and the stack

1 Answers