Loop unrolling in inlined functions in C

Question

I have a question about C compiler optimization and when/how loops in inline functions are unrolled.

I am developing a numerical code which does something like the example below. Basically, my_for() would compute some kind of stencil and call op() to do something with the data in my_type *arg for each i. Here, my_func() wraps my_for(), creating the argument and sending the function pointer to my_op()... who’s job it is to modify the ith double for each of the (arg->n) double arrays arg->dest[j].

typedef struct my_type {
  int const n;
  double *dest[16];
  double const *src[16];
} my_type;

static inline void my_for( void (*op)(my_type *,int), my_type *arg, int N ) {
  int i;

  for( i=0; i<N; ++i )
    op( arg, i );
}

static inline void my_op( my_type *arg, int i ) {
  int j;
  int const n = arg->n;

  for( j=0; j<n; ++j )
    arg->dest[j][i] += arg->src[j][i];
}

void my_func( double *dest0, double *dest1, double const *src0, double const *src1, int N ) {
  my_type Arg = {
    .n = 2,
    .dest = { dest0, dest1 },
    .src = { src0, src1 }
  };

  my_for( &my_op, &Arg, N );
}

This works fine. The functions are inlining as they should and the code is (almost) as efficient as having written everything inline in a single function and unrolled the j loop, without any sort of my_type Arg.

Here’s the confusion: if I set int const n = 2; rather than int const n = arg->n; in my_op(), then the code becomes as fast as the unrolled single-function version. So, the question is: why? If everything is being inlined into my_func(), why doesn’t the compiler see that I am literally defining Arg.n = 2? Furthermore, there is no improvement when I explicitly make the bound on the j loop arg->n, which should look just like the speedier int const n = 2; after inlining. I also tried using my_type const everywhere to really signal this const-ness to the compiler, but it just doesn't want to unroll the loop.

In my numerical code, this amounts to about a 15% performance hit. If it matters, there, n=4 and these j loops appear in a couple of conditional branches in an op().

I am compiling with icc (ICC) 12.1.5 20120612. I tried #pragma unroll. Here are my compiler options (did I miss any good ones?):

-O3 -ipo -static -unroll-aggressive -fp-model precise -fp-model source -openmp -std=gnu99 -Wall -Wextra -Wno-unused -Winline -pedantic

Thanks!

How "far away" to look for values that are known at compile-time when inlining is a difficult decision. It looks like you ran into the compiler's limit. Passing n as an explicit function parameter might improve the odds. — molbdnilo
I wonder if it would not gain much more speed if you swap the dimensions. As given now, you possibly take little advantage about cache-lines and burst fills (and you can use memcpy, which is highly optimized already). Also, filling the struct with an intializer is a gcc extension (hope you are aware of this - not a problem for me). — too honest for this site
@Olaf There are quite a few additional calculations and conditionals in the real simulation code -- all of which happen to be independent of j and don't need to be recalculated. While N~1024^3, this actually gives an order of magnitude speed up over the reverse. Thanks about the struct initializer info, I did not know that. Luckily icc doesn't seem to mind... — FiniteElement
How are you determining that the inline functions are being inlined by the compiler? By the way I thought this article interesting though a bit old, Dr. Dobbs - The New C: Inline Functions describing some of the compiler actions. — Richard Chambers

egur egur · Accepted Answer · 2015-06-12T12:05:08

Well, obviously the compiler isn't 'smart' enough to propagate the n constant and unroll the for loop. Actually it plays it safe since arg->n can change between instantiation and usage.

In order to have consistent performance across compiler generations and squeeze the maximum out of your code, do the unrolling by hand.

What people like myself do in these situations (performance is king) is rely on macros.

Macros will 'inline' in debug builds (useful) and can be templated (to a point) using macro parameters. Macro parameters which are compile time constants are guaranteed to remain this way.

Loop unrolling in inlined functions in C

2 Answers