C loop unrolling optimization performance

Question

First: I know what loop optimization is and how it works yet I found a case where I cannot explain the results.

I created a prime number checker that calls modulo on each number from 2 to n - 1, so no algorithmical optimizations.

EDIT: I know that prime numbers can be calculated more efficiently but this is just an example for loop behaviour.

Then I created a normal and an optimized version:

#include <stdlib.h>
#include <stdio.h>

typedef unsigned long long natural;

int is_prime(natural n){
    int is_not_prime = 0;

    for(natural i = 2; i < n; i += 1){
        is_not_prime |= !!!(n % i);
    }

    if(is_not_prime){
        return 0;
    }else{
        return 1;
    }
}

//__attribute__((noinline))
int is_prime_opt(natural n){
    int is_not_prime = 0;

    for(natural i = 2; i < n; i += 8){
        is_not_prime |= !!(
                !(n % i) |
                !(n % i + 1) |
                !(n % i + 2) |
                !(n % i + 3) |
                !(n % i + 4) |
                !(n % i + 5) |
                !(n % i + 6) |
                !(n % i + 7));
    }

    if(is_not_prime){
        return 0;
    }else{
        return 1;
    }
}

int main(int argc, char *argv[])
{
    if(argc != 2)
        return 1;

    natural check = atoi(argv[1]);

    if(is_prime(check)){
        printf("%llu is prime\n", check);
    }

    return 0;
}

I compiled the code with gcc with -O3 to force all optimizations done by the compiler. Since the count of iterations is not known at compile time I expect the compiler not to unroll the loop. I created a second version that does the same in blocks of 8 numbers. Since some inputs are not dividable by 8 the loop calculates at worst 7 items to much but this is acceptable.

I checked the cycles with valgrind --tool=callgrind ./prime 100000000 with the following outputs:

unoptimized:

==983== Callgrind, a call-graph generating cache profiler
==983== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==983== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==983== Command: ./prime 100000000
==983== 
==983== For interactive control, run 'callgrind_control -h'.
==983== 
==983== Events    : Ir
==983== Collected : 1000098047
==983== 
==983== I   refs:      1,000,098,047

optimized:

==2307== Callgrind, a call-graph generating cache profiler
==2307== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==2307== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==2307== Command: ./prime 100000000
==2307== 
==2307== For interactive control, run 'callgrind_control -h'.
==2307== 
==2307== Events    : Ir
==2307== Collected : 137598072
==2307== 
==2307== I   refs:      137,598,072

I expected the loop to be 10-20% faster since I save 1/8th of jumps and checks. Also the branch prediction should already speed up the first version as all except the last jump go to the same direction.

What is unclear to me why it is faster by over 7 times? Since I called it with 100M I would expect it to do at least 100M - 3 (w/o 0, 1, n) modulo, or and negation operations but it needs only 1.37 cycles per element for this (and afaik modulo isn't a cheap operation).

is that intentional not to shortcut the | like || ? you could try with || and see if the optimizer isn't actually doing this behind the scenes? — Jean-François Fabre
Have you looked at the generated code to see what it really does? — Some programmer dude
Any particular reason your condition is i < n instead of !is_not_prime && (i < n) ? — dbush
!(n % i + 1) seems to be odd, n%i will result in 0 or a positive number, adding 1 will lead to a positive number, calculating ! on it will result in 0. So every !(n % i + XX) can be optimized away. — mch
What's your CPU and architecture? x86 is a relatively register-starved CPU, even in 64-bit mode. If I'm reading your question correctly, your one-loop implementation is outperforming your hand-unrolled one. Unrolling loops increases the number of registers needed. I'd bet if you examined the actual instructions emitted by your compiler, the unrolled implementation would have a lot of register spills and fills. — Andrew Henle

mch mch · Accepted Answer · 2018-10-19T12:32:35

!(n % i + 1) seems to be odd, n%i will result in 0 or a positive number, adding 1 will lead to a positive number, calculating ! on it will result in 0. So every !(n % i + XX) can be optimized away.

It should be !(n % (i + 1)).

C loop unrolling optimization performance

2 Answers