Is there a way to instruct GCC (I'm using 4.8.4) to unroll the while
loop in the bottom function completely, i.e., peel this loop? The number of iterations of the loop is known at compilation time: 58.
Let me first explain what I have tried.
By checking GAS ouput:
gcc -fpic -O2 -S GEPDOT.c
12 registers XMM0 - XMM11 are used. If I pass the flag -funroll-loops
to gcc:
gcc -fpic -O2 -funroll-loops -S GEPDOT.c
the loop is only unrolled two times. I checked the GCC optimization options. GCC says that -funroll-loops
will turn on -frename-registers
as well, so when GCC unrolls a loop, its prior choice for register allocation is to use "left over" registers. But there are only 4 left over XMM12 - XMM15, so GCC can only unroll 2 times at its best. Had there been 48 instead of 16 XMM registers available, GCC will unroll the while loop 4 times without trouble.
Yet I did another experiment. I first unrolled the while loop two time manually, obtaining a function GEPDOT_2
. Then there is no difference at all between
gcc -fpic -O2 -S GEPDOT_2.c
and
gcc -fpic -O2 -funroll-loops -S GEPDOT_2.c
Since GEPDOT_2
already used up all registers, no unrolling is performed.
GCC does register renaming to avoid potential false dependency introduced. But I know for sure that there will be no such potential in my GEPDOT
; even if there is, it is not important. I tried unrolling the loop myself, and unrolling 4 times is faster than unrolling 2 times, faster than no unrolling. Of course I can manually unroll more times, but it is tedious. Can GCC do this for me? Thanks.
// C file "GEPDOT.c"
#include <emmintrin.h>
void GEPDOT (double *A, double *B, double *C) {
__m128d A1_vec = _mm_load_pd(A); A += 2;
__m128d B_vec = _mm_load1_pd(B); B++;
__m128d C1_vec = A1_vec * B_vec;
__m128d A2_vec = _mm_load_pd(A); A += 2;
__m128d C2_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
__m128d C3_vec = A1_vec * B_vec;
__m128d C4_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
__m128d C5_vec = A1_vec * B_vec;
__m128d C6_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
__m128d C7_vec = A1_vec * B_vec;
A1_vec = _mm_load_pd(A); A += 2;
__m128d C8_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
int k = 58;
/* can compiler unroll the loop completely (i.e., peel this loop)? */
while (k--) {
C1_vec += A1_vec * B_vec;
A2_vec = _mm_load_pd(A); A += 2;
C2_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C3_vec += A1_vec * B_vec;
C4_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C5_vec += A1_vec * B_vec;
C6_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C7_vec += A1_vec * B_vec;
A1_vec = _mm_load_pd(A); A += 2;
C8_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
}
C1_vec += A1_vec * B_vec;
A2_vec = _mm_load_pd(A);
C2_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C3_vec += A1_vec * B_vec;
C4_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C5_vec += A1_vec * B_vec;
C6_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B);
C7_vec += A1_vec * B_vec;
C8_vec += A2_vec * B_vec;
/* [write-back] */
A1_vec = _mm_load_pd(C); C1_vec = A1_vec - C1_vec;
A2_vec = _mm_load_pd(C + 2); C2_vec = A2_vec - C2_vec;
A1_vec = _mm_load_pd(C + 4); C3_vec = A1_vec - C3_vec;
A2_vec = _mm_load_pd(C + 6); C4_vec = A2_vec - C4_vec;
A1_vec = _mm_load_pd(C + 8); C5_vec = A1_vec - C5_vec;
A2_vec = _mm_load_pd(C + 10); C6_vec = A2_vec - C6_vec;
A1_vec = _mm_load_pd(C + 12); C7_vec = A1_vec - C7_vec;
A2_vec = _mm_load_pd(C + 14); C8_vec = A2_vec - C8_vec;
_mm_store_pd(C,C1_vec); _mm_store_pd(C + 2,C2_vec);
_mm_store_pd(C + 4,C3_vec); _mm_store_pd(C + 6,C4_vec);
_mm_store_pd(C + 8,C5_vec); _mm_store_pd(C + 10,C6_vec);
_mm_store_pd(C + 12,C7_vec); _mm_store_pd(C + 14,C8_vec);
}
update 1
Thanks to the comment by @user3386109, I would like to extend this question a little bit. @user3386109 raises a very good question. Actually I do have some doubt on compiler's ability for optimal register allocation, when there are so many parallel instructions to schedule.
I personally think that a reliable way is to first code the loop body (which is key to HPC) in asm inline assembly, then duplicate it as many times as I want. I had an unpopular post earlier this year: inline assembly. The code was a little different because the number of loop iterations, j, is a function argument hence unknown at compilation time. In that case I can not fully unroll the loop, so I only duplicated the assembly code twice, and converted the loop into a label and jump. It turned out that the resulting performance of my written assembly is about 5% higher than compiler generated assembly, which might suggest that compiler fails to allocate registers in our expected, optimal manner.
I was (and am still) a baby in assembly coding, so that serves a good case study for me to learn a little bit on x86 assembly. But in a long run I do not incline to code GEPDOT
with a big proportion for assembly. There are mainly three reasons:
- asm inline assembly has been critisized for not being portable. Though I don't understand why. Perhaps because different machines have different registers clobbered?
- Compiler is also getting better. So I would still prefer to algorithmic optimization and better C coding habit to assist compiler in generating good output;
- The last reason is more important. The number of iterations may not always be 58. I am developing a high performance matrix factorization subroutine. For a cache blocking factor
nb
, the number of iterations would benb-2
. I am not going to putnb
as a function argument, as I did in the earlier post. This is a machine specific parameter will be defined as a macro. So the number of iterations is known at compiled time, but may vary from machine to machine. Guess how much tedious work I have to do in manual loop unrolling for a variety ofnb
. So if there is a way to simply instruct the compiler to peel a loop, that is great.
I would be very appreciated if you can also share some experience in producing high performance, yet portable library.
-funroll-all-loops
? – Nate Eldredgewhile
withpreproc__repeat(58)
. Then write a preprocessor that searches forpreproc__repeat
, extracts the number, and duplicates the body the indicated number of times. – user3386109