4
votes

I am trying to unroll this loop by a factor of 2.

for(i=0; i<100; i++){
  x[i] = y[i] + z[i];
  z[i] = y[i] + a[i];
  z[i+1] = y[i] * a[i];
}

I have it unrolled to:

 for(i=0; i<100; i+=2){
   x[i] = y[i] + z[i];
   x[i+1] = y[i+1] + z[i+1];
   z[i] = y[i] + a[i];
   z[i+1] = y[i] * a[i];
 }

I'm not sure how to unroll the lines for z[i] as the original for loop already has z[i+1]. Can anyone help me understand this?

2
Do not focus on "Micro-Optimizations" -- that's the compiler's job. The only time you will want to second guess the compiler is if you have an actual problem, delay or "hot spot" in the program execution. Then you can profile your code to find out exactly where the problem is and then look at further optimization. (which usually isn't a code optimization issue at all, but a code logic problem -- that the compiler can't help with...)David C. Rankin
The correct way to optimize this code manually would likely be to come up with something like typedef struct {int x[100]; int y[100]; int z[100]; int a[100];} xyza;. Now we can split up the different calculations based on where the arrays are stored in memory, in relation to each other, then investigate cache use and potential for parallelization etc.Lundin

2 Answers

5
votes

I'd say simply add the lines for i+1. But you have to be sure they are in the right order, so:

 for(i=0; i<100; i+=2){
    x[i] = y[i] + z[i];
    z[i] = y[i] + a[i];
    z[i+1] = y[i] * a[i];

    // next iteration
    if (i+1 < 100) {
          x[i+1] = y[i+1] + z[i+1];
          z[i+1] = y[i+1] + a[i+1];
          z[i+2] = y[i+1] * a[i+1]; 
    }
 }

EDIT

to make it safe for all upper bounds (not only even) you have to add an if in the loop

As Adrian Mole mentions, it could be better to check the upper bound in the first place or set the size of the array conveniently

3
votes

I am trying to unroll this loop by a factor of 2.

You shouldn't be, because that won't help much. The compilers will do the unrolling and optimization for free, if you just let them. The loop you got doesn't benefit from unrolling, and the compilers are clever enough to determine that - so they don't do any unrolling either. The loop as written prevents the compilers from doing their job. Let's fix it.

First, we'd like something that actually compiles. It doesn't matter whether the elements are integers or floats or doubles - the compiler will do a very good job for all common types.

We'll be compiling it on godbolt using gcc x86-64 10.2 with -O3 option.

Let's begin:

typedef int element;

We must assume something. I'll make a reasonable assumption that the a, x, y and z arrays don't overlap. This is very important - if that's not the case, it must be stated so in the question (!!). The restrict keyword codifies this fact:

void test1(elt *restrict a, elt *restrict x, elt *restrict y, elt *restrict z) {
    for (int i=0; i<100; i++) {
        x[i] = y[i] + z[i];
        z[i] = y[i] + a[i];
        z[i+1] = y[i] * a[i];
    }
}

If you feed this function one or more arrays that overlap, you'll get undefined behavior and most likely incorrect results. But the optimization has to go on some assumptions - even your originally planned manual unrolling would be generally impossible otherwise.

If we just change compiler options from -O3 to -O3 -funroll-loops, we'll get this code nicely unrolled, by a factor of more than 2. We force the compiler's hand, and it obliges, whether it makes sense or not. So you got what you wanted, case closed, let's go home? Ah, no, that would be no fun at all - that's selling ourselves very short :)

Without forcing its hand, the compiler "only" generates code that steps this loop by i+=1.

Now observe that z[i+1] is not actually needed. All but the very last value of it is overwritten. All it does is feeding output of previous iteration to the input of the next.

We can rewrite the function without that fowarding store:

void test2(elt *restrict a, elt *restrict x, elt *restrict y, elt *restrict z) {
    elt fwd = z[0];
    for (int i=0; i<100; i++){
        x[i] = y[i] + fwd;
        z[i] = y[i] + a[i];
        fwd = y[i] * a[i];
    }
    z[100] = fwd;
}

The compiler already generates one less store per iteration, but still iterates only by i+=1.

The forwarding operation is bad news for optimizers. Some high-end CPUs have clever enough data dependency tracking and long enough pipelines to work around it to an extent - but we're not talking typical consumer systems here (not yet, anyway).

To equalize the chances for everyone, each loop iteration should be independent: its outputs should not feed to any other inputs. Instead of computing the future value of fwd, we can compute it right when it's needed:

void test3(elt *restrict a, elt *restrict x, elt *restrict y, elt *restrict z) {
    x[0] = y[0] + z[0];
    z[0] = y[0] + a[0];
    for (int i=1; i<100; i++){
        x[i] = y[i] + y[i-1] * a[i-1];
        z[i] = y[i] + a[i];
    }
    z[100] = y[99] * a[99];
}

This is very good news - the compiler can easily prove that the outputs of each loop iteration are independent, and it not only unrolls the loop by a factor of 4, i.e. iterating by i+=4, but also it completely vectorizes the inside of the loop, so that the instructions inside the loop taken together cost approximately as much as a single iteration of the non-vectorized loop (give or take), but they do 4x as much work!

Notice that this is achieved with nothing but writing the C code correctly. We have done no micro-optimizations, just removed forwarding across loop iterations - such forwarding must be considered a pessimization in today's world. Getting rid of it empowers the compiler to do read our mind :)

Now, what if we let the compiler generate code for a more modern architecture, like, say skylake? gcc -O3 -march=skylake-avx512 unrolls the loop completely. Moreover, the loop takes less than one instruction per iteration. You've read that right: the internal section of the loop produces less than 100 machine instructions - only the setup/breakdown code makes it go a bit over 100.

At this point it's probably not worth doing much more - the performance is quite satisfactory, I'd say.

But if you really wanted to unroll the loop by hand (say, if you're doing this for some school assignment, and not because you care about actual performance, as that won't change for the better) - then now is the time to do it, because the form that lets the compiler produce good code also makes manual unrolling trivial:

void test4(elt *restrict a, elt *restrict x, elt *restrict y, elt *restrict z) {
    x[0] = y[0] + z[0];
    z[0] = y[0] + a[0];
    x[1] = y[1] + y[0] * a[0];
    z[1] = y[1] + a[1];
    for (int i=2; i<100; i+=2){
        x[i] = y[i] + y[i-1] * a[i-1];
        z[i] = y[i] + a[i];
        x[i+1] = y[i+1] + y[i] * a[i];
        z[i+1] = y[i+1] + a[i+1];
    }
    z[100] = y[99] * a[99];
}

It's just as easy to unroll it x4, etc. But with any good compiler and an optimized build this version will be no better than the non-unrolled one, and if you let the compiler use better architecture and not just "baseline" x86-64, then any manual unrolling is pointless since the compiler unrolls it completely into less than 110 machine instructions for Skylake.

But wait a minizel. That's for the int data type. And, well, it turns out that int kinda sucks. Yeah, schlepping ints around isn't what fast code does these days - it works on floating point. Image and video decoding, and audio processing - they all need lots of floating point computation these days. So the CPU designers put a lot of work to make that efficient.

For float it's exactly 100 instructions, completely unrolled (no iterations, just stright-line code).

For double, with -O3 -march=skylake-avx512 -funroll-loops it's unrolled into 3 large iterations, with 47 instructions in the loop body. Everything is vectorized, the pipelines run hot and heavy, and you're getting all the money you paid for that expensive CPU back. Finally.

But we're forcing compiler's hand - again. It turns out that unrolling that loop that much isn't always critical - it depends on what CPU you run it on, and the benefits are somewhat dwarfed by expansion of the code. In production you'd want to have this function compiled two different ways, benchmark it on startup, and choose the faster one. Without -funroll-loops, the entire test3 function is 32 instructions long, and iterates the loop body 24 times. Each iteration exchanges over 120 bytes of data with RAM (total of reads and writes) in 6 instructions. That's 20 bytes per instruction on average (this is really approximate, I've not looked too closely).

From my cursory measurements it looks like this double variant runs at an order of magnitude (!) higher memory bandwidth than the original int version.

It was a fun adventure for sure.

And by the way - because it has to be said: modern compilers are marvels of engineering. I'm not exaggerating. You really have to head to https://godbolt.org and have a go at it yourself!