I'm trying to learn about vectorization by studying simple C code compiled in gcc with -O3 optimization. More specifically, how well compilers vectorize. It is a personal journey towards being able to verify gcc -O3 performance with more complex computation. I understand that conventional wisdom is that compilers are better than people, but I never take such wisdom for granted.
In my first simple test, though, I'm finding some of the choices gcc makes quite strange and, quite honestly, grossly negligent in terms of optimization. I'm willing to assume there is something the compiler is purposeful and knows something about the CPU (Intel i5-2557M in this case) that I do not. But I need some confirmation from knowledgeable people.
My simple test code (segment) is:
int i;
float a[100];
for (i=0;i<100;i++) a[i]= (float) i*i;
The resulting assembly code (segment) that corresponds to the for-loop is as follows:
.L6: ; loop starts here
movdqa xmm0, xmm1 ; copy packed integers in xmm1 to xmm0
.L3:
movdqa xmm1, xmm0 ; wait, what!? WHY!? this is redundant.
cvtdq2ps xmm0, xmm0 ; convert integers to float
add rax, 16 ; increment memory pointer for next iteration
mulps xmm0, xmm0 ; pack square all integers in xmm0
paddd xmm1, xmm2 ; pack increment all integers by 4
movaps XMMWORD PTR [rax-16], xmm0 ; store result
cmp rax, rdx ; test loop termination
jne .L6
I understand all the steps, and computationally, all of it makes sense. What I don't understand, though, is gcc choosing to incorporate in the iterative loop a step to load xmm1 with xmm0 right after xmm0 was loaded with xmm1. i.e.
.L6
movdqa xmm0, xmm1 ; loop starts here
.L3
movdqa xmm1, xmm0 ; grrr!
This alone makes me question the sanity of the optimizer. Obviously, the extra MOVDQA does not disturb data, but at face-value, it would seems grossly negligent on the part of gcc.
Earlier in the assembly code (not shown), xmm0 and xmm2 are initialized to some value meaningful for vectorization, so obviously, at the onset of the loop, the code has to skip the first MOVDQA. But why doesn't gcc simply rearrange, as shown below.
.L3
movdqa xmm1, xmm0 ; initialize xmm1 PRIOR to loop
.L6
movdqa xmm0, xmm1 ; loop starts here
Or even better, simply initialize xmm1 instead of xmm0 and dump the MOVDQA xmm1, xmm0 step altogether!
I am prepared to believe that the CPU is smart enough to skip the redundant step or something like that, but how can I trust gcc to fully optimize complex code, if it can even get this simple code right? Or can someone provide a sound explanation that would give me faith that gcc -O3 is good stuff?