First: I know what loop optimization is and how it works yet I found a case where I cannot explain the results.
I created a prime number checker that calls modulo on each number from 2 to n - 1, so no algorithmical optimizations.
EDIT: I know that prime numbers can be calculated more efficiently but this is just an example for loop behaviour.
Then I created a normal and an optimized version:
#include <stdlib.h>
#include <stdio.h>
typedef unsigned long long natural;
int is_prime(natural n){
int is_not_prime = 0;
for(natural i = 2; i < n; i += 1){
is_not_prime |= !!!(n % i);
}
if(is_not_prime){
return 0;
}else{
return 1;
}
}
//__attribute__((noinline))
int is_prime_opt(natural n){
int is_not_prime = 0;
for(natural i = 2; i < n; i += 8){
is_not_prime |= !!(
!(n % i) |
!(n % i + 1) |
!(n % i + 2) |
!(n % i + 3) |
!(n % i + 4) |
!(n % i + 5) |
!(n % i + 6) |
!(n % i + 7));
}
if(is_not_prime){
return 0;
}else{
return 1;
}
}
int main(int argc, char *argv[])
{
if(argc != 2)
return 1;
natural check = atoi(argv[1]);
if(is_prime(check)){
printf("%llu is prime\n", check);
}
return 0;
}
I compiled the code with gcc with -O3
to force all optimizations done by the compiler. Since the count of iterations is not known at compile time I expect the compiler not to unroll the loop.
I created a second version that does the same in blocks of 8 numbers. Since some inputs are not dividable by 8 the loop calculates at worst 7 items to much but this is acceptable.
I checked the cycles with valgrind --tool=callgrind ./prime 100000000
with the following outputs:
unoptimized:
==983== Callgrind, a call-graph generating cache profiler
==983== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==983== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==983== Command: ./prime 100000000
==983==
==983== For interactive control, run 'callgrind_control -h'.
==983==
==983== Events : Ir
==983== Collected : 1000098047
==983==
==983== I refs: 1,000,098,047
optimized:
==2307== Callgrind, a call-graph generating cache profiler
==2307== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==2307== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==2307== Command: ./prime 100000000
==2307==
==2307== For interactive control, run 'callgrind_control -h'.
==2307==
==2307== Events : Ir
==2307== Collected : 137598072
==2307==
==2307== I refs: 137,598,072
I expected the loop to be 10-20% faster since I save 1/8th of jumps and checks. Also the branch prediction should already speed up the first version as all except the last jump go to the same direction.
What is unclear to me why it is faster by over 7 times? Since I called it with 100M I would expect it to do at least 100M - 3 (w/o 0, 1, n) modulo, or and negation operations but it needs only 1.37 cycles per element for this (and afaik modulo isn't a cheap operation).
|
like||
? you could try with||
and see if the optimizer isn't actually doing this behind the scenes? – Jean-François Fabrei < n
instead of!is_not_prime && (i < n)
? – dbush!(n % i + 1)
seems to be odd,n%i
will result in0
or a positive number, adding1
will lead to a positive number, calculating!
on it will result in0
. So every!(n % i + XX)
can be optimized away. – mch