1
votes

I am getting a seg fault in a loop only when the loop is fully vectorized on an AVX machine (Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz).

Compiled with gcc -c -march=native MyClass.cpp -O3 -ftree-vectorizer-verbose=6

I was experimenting with aligning the arrays such that these messages from -ftree-vectorizer-verbose=6 are avoided:

MyClass.cpp:352: note: dependence distance modulo vf == 0 between this_7(D)->x[i_101] and this_7(D)->x[i_101]
MyClass.cpp:352: note: vect_model_load_cost: unaligned supported by hardware.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: vect_model_store_cost: unaligned supported by hardware.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: Alignment of access forced using peeling.
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: cost model: prologue peel iters set to vf/2.
MyClass.cpp:352: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .

What I wanted to see (and did see) is:

MyClass.cpp:352: note: dependence distance modulo vf == 0 between this_7(D)->x[i_101] and this_7(D)->x[i_101]
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 1, outside_cost = 0.
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .

Now, I am not a C/C++/Assembler guru by any stretch, but when I got the seg fault I assumed I had some pointer / array / other screwup in my code and that the fully vectorized loop was just exposing this. But after two days of learning assembler I can't track it down. So here I am.

The code looks like this (hopefully I'm including everything relevant -- I can't share the actual .cpp in its entirety here):

class MyClass {

private:
    static const long maxElems = 1024;
    static const double otherVar = 0.9;
    double x[maxElems] __attribute__ ((aligned (32)));  <-- gcc reports fully vectorized
    //double x[maxElems];   <-- leads to unaligned peeling

public:
    void myFunc() {
        // Always works
        for (int i=0; i<maxElems; ++i) printf("Test: %d %.4e\n", i, x[i]);

        // Seg fault if fully vectorized (no peeling)
        for (int i=0; i<maxElems; ++i) {
            x[i] = x[i] - 42;
        } 

        // Works if no seg fault earlier
        for (int i=0; i<maxElems; ++i) printf("Test: %d %.4e\n", i, x[i]);
    }
}

When it's fully vectorized I see (using -Wa,-alh flags to see assembler):

 989      00
 990 0b56 488B4424      movq    40(%rsp), %rax
 990      28
 991 0b5b C5FD280D      vmovapd .LC8(%rip), %ymm1
 991      00000000 
 992                    .p2align 4,,10
 993 0b63 0F1F4400      .p2align 3
 993      00
 994                .L153:
 995 0b68 C5FD2800      vmovapd (%rax), %ymm0
 996 0b6c C5FD5CC1      vsubpd  %ymm1, %ymm0, %ymm0
 997 0b70 C5FD2900      vmovapd %ymm0, (%rax)
 998 0b74 4883C020      addq    $32, %rax
 999 0b78 4C39E0        cmpq    %r12, %rax
 1000 0b7b 75EB             jne .L153

Again, usual caveat about "not knowing assembler" but I did spend a fair amount of time printing pointers and inspecting assembler to convince myself that this loop starts and ends at the start and end of the array. But the address of the start of x is not divisible by 32 when I get the seg fault. I assume that's what's causing the trouble.

And yes, I do know that I could allocate x on the heap and select where it ends up to get it aligned. But part of my experiment here is to have MyClass be of a fixed size with all the data inside (think: cache efficiency), so I have instances of MyClass allocated on the heap, pointers to them in a collection, and x is inside MyClass.

Isn't that align attribute supposed to put x on a 32-byte boundary? The compiler is assuming that, then the vmovapd is blowing up because it's not, right?

GCC documentation on alignment: https://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html

Do I have to align MyClass on the heap somehow instead? How do I do that? How do I tell GCC I did that so it vectorizes like I want?

EDIT: I have solved this problem (thanks in part to the comments and answers below). It is possible to guarantee alignment of an object when created on the heap by overriding the default new operator. When I did this, I got no seg faults and my code was still perfectly vectorized as I wanted. How I did it:

static void* operator new(size_t size) throw (std::bad_alloc) {
    void *alignedPointer;
    int alignError = 0;

    // Try to allocate the required amount of memory (using POSIX standard aligned allocation)
    alignError = posix_memalign(&alignedPointer, VECTOR_ALIGN_BYTES, size);

    // Throw/Report error if any
    if (alignError) {
        throw std::bad_alloc();
    }

    // Return a pointer to this aligned memory location
    return alignedPointer;
}

static void operator delete(void* alignedPointer) {
    // POSIX aligned memory allocation can be freed normally with free()
    free(alignedPointer);
}

C++ calls the constructors/destructor for you just after/before calling the operators. Alignment is thus controlled by the class itself. There are other aligned memory allocators, too, if you have a different preference. I used POSIX.

Two caveats: If someone calls placement new with an arbitrary address, you'll still be unaligned. If someone declares your class as a member of their class, and their class is allocated on the heap, you could be unaligned. I have put a check in my constructor and throw an error if this is detected.

1
Try aligning MyClass also. The aligned attribute on a non-static data member can only control the offset, not the absolute placement in memory.Ben Voigt
By "aligning MyClass" I assume you mean that when I create it on the heap, I ensure it is on a 32-byte boundary? Will the align attribute on the data member ensure it's offset to stay aligned?Corey A. Henderson
No, I mean stick __attribute__ ((aligned (32))) on class MyClass.Ben Voigt
@BenVoigt, this doesn't work. I put the align attribute in the class declaration and tried again but no change. The object is created in the heap 'wherever' and I still get the seg fault. If the object "just happens" to be allocated on a 32-byte boundary, that instance works fine, but this align attribute does not ensure that. I have not found a way to force alignment entirely within the class, if the arrays are members and stored inside the object. I am going to try to force the alignment at initialization, but that means a class can't control its own alignment, which I think is a bad thing.Corey A. Henderson
I missed the part about you using the heap (or free store) with new. Of course you need to align the memory on the heap with something like posix_memalign (I used _mm_malloc because it works with GCC, MinGW, ICC, and MSVC). But for statically allocated and stack allocated data you should only need __attribute__ ((aligned (32))).Z boson

1 Answers

3
votes
__attribute__((aligned(32))

may not do what we think it does (bug? Feature?).

It basically tells the compiler it can assume this thing is aligned, which it may not be. If it's on the heap, you need to allocate with posix_memalign or similar.

GCC will actually get pointer arithmetic wrong if __attribute__((aligned(...)) is set but the allocation is not aligned.

s2->aligned_var = 0x199c030
&s2->aligned_var % 0x40  = 0x0

https://gcc.gnu.org/ml/gcc/2014-06/msg00308.html