I am getting a seg fault in a loop only when the loop is fully vectorized on an AVX machine (Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz).
Compiled with gcc -c -march=native MyClass.cpp -O3 -ftree-vectorizer-verbose=6
I was experimenting with aligning the arrays such that these messages from -ftree-vectorizer-verbose=6 are avoided:
MyClass.cpp:352: note: dependence distance modulo vf == 0 between this_7(D)->x[i_101] and this_7(D)->x[i_101]
MyClass.cpp:352: note: vect_model_load_cost: unaligned supported by hardware.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: vect_model_store_cost: unaligned supported by hardware.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: Alignment of access forced using peeling.
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: cost model: prologue peel iters set to vf/2.
MyClass.cpp:352: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
What I wanted to see (and did see) is:
MyClass.cpp:352: note: dependence distance modulo vf == 0 between this_7(D)->x[i_101] and this_7(D)->x[i_101]
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 1, outside_cost = 0.
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
Now, I am not a C/C++/Assembler guru by any stretch, but when I got the seg fault I assumed I had some pointer / array / other screwup in my code and that the fully vectorized loop was just exposing this. But after two days of learning assembler I can't track it down. So here I am.
The code looks like this (hopefully I'm including everything relevant -- I can't share the actual .cpp in its entirety here):
class MyClass {
private:
static const long maxElems = 1024;
static const double otherVar = 0.9;
double x[maxElems] __attribute__ ((aligned (32))); <-- gcc reports fully vectorized
//double x[maxElems]; <-- leads to unaligned peeling
public:
void myFunc() {
// Always works
for (int i=0; i<maxElems; ++i) printf("Test: %d %.4e\n", i, x[i]);
// Seg fault if fully vectorized (no peeling)
for (int i=0; i<maxElems; ++i) {
x[i] = x[i] - 42;
}
// Works if no seg fault earlier
for (int i=0; i<maxElems; ++i) printf("Test: %d %.4e\n", i, x[i]);
}
}
When it's fully vectorized I see (using -Wa,-alh flags to see assembler):
989 00
990 0b56 488B4424 movq 40(%rsp), %rax
990 28
991 0b5b C5FD280D vmovapd .LC8(%rip), %ymm1
991 00000000
992 .p2align 4,,10
993 0b63 0F1F4400 .p2align 3
993 00
994 .L153:
995 0b68 C5FD2800 vmovapd (%rax), %ymm0
996 0b6c C5FD5CC1 vsubpd %ymm1, %ymm0, %ymm0
997 0b70 C5FD2900 vmovapd %ymm0, (%rax)
998 0b74 4883C020 addq $32, %rax
999 0b78 4C39E0 cmpq %r12, %rax
1000 0b7b 75EB jne .L153
Again, usual caveat about "not knowing assembler" but I did spend a fair amount of time printing pointers and inspecting assembler to convince myself that this loop starts and ends at the start and end of the array. But the address of the start of x is not divisible by 32 when I get the seg fault. I assume that's what's causing the trouble.
And yes, I do know that I could allocate x on the heap and select where it ends up to get it aligned. But part of my experiment here is to have MyClass be of a fixed size with all the data inside (think: cache efficiency), so I have instances of MyClass allocated on the heap, pointers to them in a collection, and x is inside MyClass.
Isn't that align attribute supposed to put x on a 32-byte boundary? The compiler is assuming that, then the vmovapd is blowing up because it's not, right?
GCC documentation on alignment: https://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html
Do I have to align MyClass on the heap somehow instead? How do I do that? How do I tell GCC I did that so it vectorizes like I want?
EDIT: I have solved this problem (thanks in part to the comments and answers below). It is possible to guarantee alignment of an object when created on the heap by overriding the default new
operator. When I did this, I got no seg faults and my code was still perfectly vectorized as I wanted. How I did it:
static void* operator new(size_t size) throw (std::bad_alloc) {
void *alignedPointer;
int alignError = 0;
// Try to allocate the required amount of memory (using POSIX standard aligned allocation)
alignError = posix_memalign(&alignedPointer, VECTOR_ALIGN_BYTES, size);
// Throw/Report error if any
if (alignError) {
throw std::bad_alloc();
}
// Return a pointer to this aligned memory location
return alignedPointer;
}
static void operator delete(void* alignedPointer) {
// POSIX aligned memory allocation can be freed normally with free()
free(alignedPointer);
}
C++ calls the constructors/destructor for you just after/before calling the operators. Alignment is thus controlled by the class itself. There are other aligned memory allocators, too, if you have a different preference. I used POSIX.
Two caveats: If someone calls placement new
with an arbitrary address, you'll still be unaligned. If someone declares your class as a member of their class, and their class is allocated on the heap, you could be unaligned. I have put a check in my constructor and throw an error if this is detected.
MyClass
also. The aligned attribute on a non-static data member can only control the offset, not the absolute placement in memory. – Ben Voigt__attribute__ ((aligned (32)))
onclass MyClass
. – Ben Voigtnew
. Of course you need to align the memory on the heap with something likeposix_memalign
(I used_mm_malloc
because it works with GCC, MinGW, ICC, and MSVC). But for statically allocated and stack allocated data you should only need__attribute__ ((aligned (32)))
. – Z boson