I recently stumbled onto implicit SSE/AVX loads/stores. I thought these were some special extensions of GCC but then realized they work on MSVC as well.
__m128 a = *(__m128*)data // same as __m128 a = _mm_load_ps(data)?
__m128 *b = (__m128*)result; // same as _mm_store_ps(result, a)?
What's the proper syntax for these implicit loads/stores?
From what I have read (Addressing a non-integer address, and sse) the implicit load/stores use aligned loads/stores so the memory has to be properly aligned. Is it fair to assume they work the same for most compilers (GCC/ICC/MSVC/Clang/MinGW,...) that support the SSE/AVX intrinsics? What's the motivation for having these implicit load/stores?
My next set of questions is in regards to push and popping SSE/AVX registers to the stack. How is this implemented? What if the stack is not 16-byte aligned? Does it then use unaligned load/stores? As I understand the stack is usually 16 byte aligned now but not necessarily 32 byte aligned (at least in 64-bit mode). If an algorithm has high AVX occupancy and needs to frequently push AVX registers on to the stack would it make sense to align the stack to 32 bytes (e.g. in GCC with mpreferred-stack-boundary) for potentially increased performance?
_mm_store_ps(result, a)should be equivalent to__m128 *result = (__m128*)a. The signature of_mm_store_psisvoid _mm_store_ps (float* mem_addr, __m128 a), wheremem_addrmust be aligned to 16-byte boundary. - plasmacel