9
votes

How do I convince GCC to unroll a loop where the number of iterations is known, but large?

I'm compiling with -O3.

The real code in question is more complex, of course, but here's a boiled-down example that has the same behavior:

int const constants[] = { 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144 };

int get_sum_1()
{
    int total = 0;
    for (int i = 0; i < CONSTANT_COUNT; ++i)
    {
        total += constants[i];
    }
    return total;
}

...if CONSTANT_COUNT is defined as 8 (or less) then GCC will unroll the loop, propagate the constants, and reduce the entire function down to a simple return <value>;. If, on the other hand, CONSTANT_COUNT is 9 (or greater) then the loop is not unrolled, and GCC produces a binary which loops, reads the constants, and adds them at run-time - even though, in theory, the function could still be optimized down to just returning a constant. (Yes, I've looked at the decompiled the binary.)

If I manually unroll the loop, like this:

int get_sum_2()
{
    int total = 0;
    total += constants[0];
    total += constants[1];
    total += constants[2];
    total += constants[3];
    total += constants[4];
    total += constants[5];
    total += constants[6];
    total += constants[7];
    total += constants[8];
    //total += constants[9];
    return total;
}

Or this:

#define ADD_CONSTANT(z, v, c) total += constants[v];

int get_sum_2()
{
    int total = 0;
    BOOST_PP_REPEAT(CONSTANT_COUNT, ADD_CONSTANT, _)
    return total;
}

...then the function is optimized down to returning a constant. So, GCC appears to be able to handle the constant propagation for larger loops, once unrolled; the hang-up appears to be just getting GCC to consider unrolling the longer loop in the first place.

However, neither hand-unrolling nor BOOST_PP_REPEAT are viable options, because there are some cases where CONSTANT_COUNT is a run-time expression, and the same code still needs to work correctly for those cases. (Performance isn't as critical in those cases.)

I'm working in C (not C++) so neither template meta-programming nor constexpr are available to me.

I've tried -funroll-loops, -funroll-all-loops, -fpeel-loops, and setting large values for max-unrolled-insns, max-average-unrolled-insns, max-unroll-times, max-peeled-insns, max-peel-times, max-completely-peeled-insns, and max-completely-peel-times, none of which seem to make a difference.

I'm using GCC 4.8.2, on Linux, x86_64.

Any ideas? Is there a flag or parameter I'm missing...?

2
So, you basically want GCC to figure out what the loop would compute, and just use that number; you don't really care if it mentally unrolls the loop or not, as long as it boils it down to a constant, right?SamB

2 Answers

3
votes

I'm not sure if this workaround is applicable to your actual problem but I found that GCC 4.9.0 20140604 (prerelease) on an x86_64 running Parabola GNU/Linux unrolls the following loop up to and including CONSTANT_COUNT == 33.

int
get_sum()
{
  int total = 0;
  int i, j, k = 0;
  for (j = 0; j < 2; ++j)
    {
      for (i = 0; i < CONSTANT_COUNT / 2; ++i)
        {
          total += constants[k++];
        }
    }
  if (CONSTANT_COUNT % 2)
    total += constants[k];
  return total;
}

I have only passed it the -O3 flag. The assembly code for get_sum is really just

movl $561, %eax
ret

I did not try but maybe the pattern can be extended even further.

It seems odd to me that this works since – at least to my human eyes – the code looks much more complex now. Unfortunately, it is a rather intrusive way to force the unrolling. A compiler flag would have been much nicer.

1
votes

GCC has a big lot of obscure parameters and program arguments regarding loop unrolling (and optimizations). You could play with -funroll-loops, -funroll-all-loops, --param name=value, (e.g. with name being max-unroll-times ....) etc.

Order of arguments to gcc matters a lot. You probably want to put -O3 first, and the weird options above after it.

However, increasing the unrolling will not always improve the performance.

Last but not least, you could code your own GCC plugin which would change the unrolling criteria.

Cleverly using __builtin_prefetch might improve performance, see this answer (but using it carelessly will degrade performance)

You need to benchmark. My feeling is that premature micro-optimization is a big lose of your time.