19
votes

I have performance critical code and there is a huge function that allocates like 40 arrays of different size on the stack at the beginning of the function. Most of these arrays have to have certain alignment (because these arrays are accessed somewhere else down the chain using cpu instructions that require memory alignment (for Intel and arm CPUs).

Since some versions of gcc simply fail to align stack variables properly (notably for arm code), or even sometimes it says that maximum alignment for the target architecture is less than what my code actually requests, I simply have no choice but to allocate these arrays on the stack and align them manually.

So, for each array I need to do something like that to get it aligned properly:

short history_[HIST_SIZE + 32];
short * history = (short*)((((uintptr_t)history_) + 31) & (~31));

This way, history is now aligned on 32-byte boundary. Doing the same is tedious for all 40 arrays, plus this part of code is really cpu intensive and I simply cannot do the same alignment technique for each of the arrays (this alignment mess confuses the optimizer and different register allocation slows down the function big time, for better explanation see explanation at the end of the question).

So... obviously, I want to do that manual alignment only once and assume that these arrays are located one right after the other. I also added extra padding to these arrays so that they are always multiple of 32 bytes. So, then I simply create a jumbo char array on the stack and cast it to a struct that has all these aligned arrays:

struct tmp
{
   short history[HIST_SIZE];
   short history2[2*HIST_SIZE];
   ...
   int energy[320];
   ...
};


char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));

Something like that. Maybe not the most elegant, but it produced really good result and manual inspection of generated assembly proves that generated code is more or less adequate and acceptable. Build system was updated to use newer GCC and suddenly we started to have some artifacts in generated data (e.g. output from validation test suite is not bit exact anymore even in pure C build with disabled asm code). It took long time to debug the issue and it appeared to be related to aliasing rules and newer versions of GCC.

So, how can I get it done? Please, don't waste time trying to explain that it's not standard, not portable, undefined etc (I've read many articles about that). Also, there is no way I can change the code (I would perhaps consider modifying GCC as well to fix the issue, but not refactoring the code)... basically, all I want is to apply some black magic spell so that newer GCC produces the functionally same code for this type of code without disabling optimizations?

Edit:

  • I used this code on multiple OSes/compilers, but started to have issues when I switched to newer NDK which is based on GCC 4.6. I get the same bad result with GCC 4.7 (from NDK r8d)
  • I mention 32 byte alignment. If it hurts your eyes, substitute it with any other number that you like, for example 666 if it helps. There is absolutely no point to even mention that most architectures don't need that alignment. If I align 8KB of local arrays on stack, I loose 15 bytes for 16 byte alignment and I loose 31 for 32 byte alignment. I hope it's clear what I mean.
  • I say that there are like 40 arrays on the stack in performance critical code. I probably also need to say that it's a third party old code that has been working well and I don't want to mess with it. No need to say if it's good or bad, no point for that.
  • This code/function has well tested and defined behavior. We have exact numbers of the requirements of that code e.g. it allocates Xkb or RAM, uses Y kb of static tables, and consumes up to Z kb of stack space and it cannot change, since the code won't be changed.
  • By saying that "alignment mess confuses the optimizer" I mean that if I try to align each array separately code optimizer allocates extra registers for the alignment code and performance critical parts of code suddenly don't have enough registers and start trashing to stack instead which results in a slowdown of the code. This behavior was observed on ARM CPUs (I'm not worried about intel at all by the way).
  • By artifacts I meant that the output becomes non-bitexact, there is some noise added. Either because of this type aliasing issue or there is some bug in the compiler that results eventually in wrong output from the function.

    In short, the point of the question... how can I allocate random amount of stack space (using char arrays or alloca, and then align pointer to that stack space and reinterpret this chunk of memory as some structure that has some well defined layout that guarantees alignment of certain variables as long as the structure itself is aligned properly. I'm trying to cast the memory using all kinds of approaches, I move the big stack allocation to a separate function, still I get bad output and stack corruption, I'm really starting to think more and more that this huge function hits some kind of bug in gcc. It's quite strange, that by doing this cast I can't get this thing done no matter what I try. By the way, I disabled all optimizations that require any alignment, it's pure C-style code now, still I get bad results (non-bitexact output and occasional stack corruptions crashes). The simple fix that fixes it all, I write instead of:

    char buf[sizeof(tmp) + 32];
    tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
    

    this code:

    tmp buf;
    tmp * X = &buf;
    

    then all bugs disappear! The only problem is that this code doesn't do proper alignment for the arrays and will crash with optimizations enabled.

    Interesting observation:
    I mentioned that this approach works well and produces expected output:

    tmp buf;
    tmp * X = &buf;
    

    In some other file I added a standalone noinline function that simply casts a void pointer to that struct tmp*:

    struct tmp * to_struct_tmp(void * buffer32)
    {
        return (struct tmp *)buffer32;
    }
    

    Initially, I thought that if I cast alloc'ed memory using to_struct_tmp it will trick gcc to produce results that I expected to get, yet, it still produces invalid output. If I try to modify working code this way:

    tmp buf;
    tmp * X = to_struct_tmp(&buf);
    

    then i get the same bad result! WOW, what else can I say? Perhaps, based on strict-aliasing rule gcc assumes that tmp * X isn't related to tmp buf and removed tmp buf as unused variable right after return from to_struct_tmp? Or does something strange that produces unexpected result. I also tried to inspect generated assembly, however, changing tmp * X = &buf; to tmp * X = to_struct_tmp(&buf); produces extremely different code for the function, so, somehow that aliasing rule affects code generation big time.

    Conclusion:
    After all kinds of testing, I have an idea why possibly I can't get it to work no matter what I try. Based on strict type aliasing, GCC thinks that the static array is unused and therefore doesn't allocate stack for it. Then, local variables that also use stack are written to the same location where my tmp struct is stored; in other words, my jumbo struct shares the same stack memory as other variables of the function. Only this could explain why it always results in the same bad result. -fno-strict-aliasing fixes the issue, as expected in this case.

  • 4
    You can disable strict aliasing rules by passing the -fno-strict-aliasing switch to gcc. Maybe that's an option for you?Praetorian
    @JonathonReinhart: Possibly for reasons of TLDRLightness Races in Orbit
    I would say that it should be closed as NARQ because you're clearly having a problem with a specific compiler in specific circumstances, but you don't say what compiler version you're using, nor do you give a code example that will fail to work on that compiler version. It's a very low-level, compiler-specific problem; that means we need to know exactly what compiler is having the problem, on what architecture, and we need to see exactly what code generates the problematic code. Of course, doing that makes the question too localized.Nicol Bolas
    @Pavel: "somebody else will have the same issue and it will be helpful for them." What same issue? I don't understand what the issue is because you haven't provided enough information to actually understand what code you have to write to make it happen, on what compiler versions it happens on, and what compiled architectures exhibit the problem. I'm not accusing the question of being bad; I'm accusing it of being incomplete. That's one of the definitions of NARQ. Show me a fully-compilable test example that exhibits the problem.Nicol Bolas

    4 Answers

    3
    votes

    Just disable alias-based optimization and call it a day

    If your problems are in fact caused by optimizations related to strict aliasing, then -fno-strict-aliasing will solve the problem. Additionally, in that case, you don't need to worry about losing optimization because, by definition, those optimizations are unsafe for your code and you can't use them.

    Good point by Praetorian. I recall one developer's hysteria prompted by the introduction of alias analysis in gcc. A certain Linux kernel author wanted to (A) alias things, and (B) still get that optimization. (That's an oversimplification but it seems like -fno-strict-aliasing would solve the problem, not cost much, and they all must have had other fish to fry.)

    5
    votes

    First I'd like to say I'm definitely with you when you ask not to buzz about "standard violation", "implementation-dependent" and etc. Your question is absolutely legitimate IMHO.

    Your approach to pack all the arrays within one struct also makes sense, that's what I'd do.

    It's unclear from the question formulation which "artifacts" do you observe. Is there any unneeded code generated? Or data misalignment? If the latter is the case - you may (hopefully) use things like STATIC_ASSERT to ensure at compile-time that things are aligned properly. Or at least have some run-time ASSERT at debug build.

    As Eric Postpischil proposed, you may consider declaring this structure as global (if this is applicable for the case, I mean multi-threading and recursion are not an option).

    One more point that I'd like to notice is so-called stack probes. When you allocate a lot of memory from the stack in a single function (more than 1 page to be exact) - on some platforms (such as Win32) the compiler adds an extra initialization code, known as stack probes. This may also have some performance impact (though likely to be minor).

    Also, if you don't need all the 40 arrays simultaneously you may arrange some of them in a union. That is, you'll have one big struct, inside which some sub-structs will be grouped into union.

    4
    votes

    There are a number of issues here.

    Alignment: There is little that requires 32-byte alignment. 16-byte alignment is beneficial for SIMD types on current Intel and ARM processors. With AVX on current Intel processors, the performance cost of using addresses that are 16-byte aligned but not 32-byte aligned is generally mild. There may be a large penalty for 32-byte stores that cross a cache line, so 32-byte alignment can be helpful there. Otherwise, 16-byte alignment may be fine. (On OS X and iOS, malloc returns 16-byte aligned memory.)

    Allocation in critical code: You should avoid allocating memory in performance critical code. Generally, memory should be allocated at the start of the program, or before performance critical work begins, and reused during the performance critical code. If you allocate memory before performance critical code begins, then the time it takes to allocate and prepare the memory is essentially irrelevant.

    Large, numerous arrays on the stack: The stack is not intended for large memory allocations, and there are limits to its use. Even if you are not encountering problems now, apparently unrelated changes in your code in the future could interact with using a lot of memory on the stack and cause stack overflows.

    Numerous arrays: 40 arrays is a lot. Unless these are all in use for different data at the same time, and necessarily so, you should seek to reuse some of the same space for different data and purposes. Using different arrays unnecessarily can cause more cache thrashing than necessary.

    Optimization: It is not clear what you mean by saying that the “alignment mess confuses the optimizer and different register allocation slows down the function big time”. If you have multiple automatic arrays inside a function, I would generally expect the optimizer to know they are different, even if you derive pointers from the arrays by address arithmetic. E.g., given code such as a[i] = 3; b[i] = c[i]; a[i] = 4;, I would expect the optimizer to know that a, b, and c are different arrays, and therefore c[i] cannot be the same as a[i], so it is okay to eliminate a[i] = 3;. Perhaps an issue you have is that, with 40 arrays, you have 40 pointers to arrays, so the compiler ends up moving pointers into and out of registers?

    In that case, reusing fewer arrays for multiple purposes might help reduce that. If you have an algorithm that is actually using 40 arrays at one time, then you might look at restructuring the algorithm so it uses fewer arrays at a time. If an algorithm has to point to 40 different places in memory, then you essentially need 40 pointers, regardless of where or how they are allocated, and 40 pointers is more than there are registers available.

    If you have other concerns about optimization and register use, you should be more specific about them.

    Aliasing and artifacts: You report there are some aliasing and artifact problems, but you do not give sufficient details to understand them. If you have one large char array that you reinterpret as a struct containing all your arrays, then there is no aliasing within the struct. So it is not clear what issues you are encountering.

    2
    votes

    32 byte alignment sounds as if you are pushing the button too far. No CPU instruction should require an alignement as large as that. Basically an alignement as wide as the largest data type of your architecture should suffice.

    C11 has the concept fo maxalign_t, which is a dummy type of maximum alignment for the architecture. If your compiler doesn't have it, yet, you can easily simulate it by something like

    union maxalign0 {
      long double a;
      long long b;
      ... perhaps a 128 integer type here ...
    };
    
    typedef union maxalign1 maxalign1;
    union maxalign1 {
      unsigned char bytes[sizeof(union maxalign0)];
      union maxalign0;
    }
    

    Now you have a data type that has the maximal alignment of your platform and that is default initialized with all bytes set to 0.

    maxalign1 history_[someSize];
    short * history = history_.bytes;
    

    This avoids the awful address computations that you do currently, you'd only have to do some adoption of someSize to take into account that you always allocate multiples of sizeof(maxalign1).

    Also be asured that this has no aliasing problems. First of all unions in C made for this, and then character pointers (of any version) are always allowed to alias any other pointer.