7
votes

I'm trying to learn about vectorization by studying simple C code compiled in gcc with -O3 optimization. More specifically, how well compilers vectorize. It is a personal journey towards being able to verify gcc -O3 performance with more complex computation. I understand that conventional wisdom is that compilers are better than people, but I never take such wisdom for granted.

In my first simple test, though, I'm finding some of the choices gcc makes quite strange and, quite honestly, grossly negligent in terms of optimization. I'm willing to assume there is something the compiler is purposeful and knows something about the CPU (Intel i5-2557M in this case) that I do not. But I need some confirmation from knowledgeable people.

My simple test code (segment) is:

int i;
float a[100];

for (i=0;i<100;i++) a[i]= (float) i*i;

The resulting assembly code (segment) that corresponds to the for-loop is as follows:

.L6:                        ; loop starts here
    movdqa  xmm0, xmm1      ; copy packed integers in xmm1 to xmm0
.L3:
    movdqa  xmm1, xmm0      ; wait, what!?  WHY!?  this is redundant.
    cvtdq2ps    xmm0, xmm0  ; convert integers to float
    add rax, 16             ; increment memory pointer for next iteration
    mulps   xmm0, xmm0      ; pack square all integers in xmm0
    paddd   xmm1, xmm2      ; pack increment all integers by 4 
    movaps  XMMWORD PTR [rax-16], xmm0   ; store result 
    cmp rax, rdx            ; test loop termination
    jne .L6                 

I understand all the steps, and computationally, all of it makes sense. What I don't understand, though, is gcc choosing to incorporate in the iterative loop a step to load xmm1 with xmm0 right after xmm0 was loaded with xmm1. i.e.

 .L6
        movdqa  xmm0, xmm1      ; loop starts here
 .L3
        movdqa  xmm1, xmm0      ; grrr! 

This alone makes me question the sanity of the optimizer. Obviously, the extra MOVDQA does not disturb data, but at face-value, it would seems grossly negligent on the part of gcc.

Earlier in the assembly code (not shown), xmm0 and xmm2 are initialized to some value meaningful for vectorization, so obviously, at the onset of the loop, the code has to skip the first MOVDQA. But why doesn't gcc simply rearrange, as shown below.

.L3
        movdqa  xmm1, xmm0     ; initialize xmm1 PRIOR to loop
.L6
        movdqa  xmm0, xmm1     ; loop starts here 

Or even better, simply initialize xmm1 instead of xmm0 and dump the MOVDQA xmm1, xmm0 step altogether!

I am prepared to believe that the CPU is smart enough to skip the redundant step or something like that, but how can I trust gcc to fully optimize complex code, if it can even get this simple code right? Or can someone provide a sound explanation that would give me faith that gcc -O3 is good stuff?

1
@Down voters: please comment why.Stefan
Are you sure that your code is faster than the compilers? Have you tried timing them?Degustaf
@Stefan Not being one of the downvoters, I would guess that it is because this question verges on being a rant against gcc.Degustaf
What version of gcc is being used?Michael Burr
.L3 seems to be a redundant label that might've been leftover from some optimization pass. (possibly removing the entry condition to the for-loop) So I'm not surprised that other stuff got left behind. After all, compiler's aren't perfect.Mysticial

1 Answers

4
votes

I'm not 100% sure, but it looks like your loop destroys xmm0 by converting it to float, so you to have the integer value in xmm1 and then copy over to another register (in this case xmm0).

Whilst compilers are known to sometimes issue unnecessary instructions, I can't really see how this is the case in this instance.

If you want xmm0 (or xmm1) to remain integer, then don't have a cast of float for the first value of i. Perhaps what you wanted to do is:

 for (i=0;i<100;i++) 
    a[i]= (float)(i*i);

But on the other hand, gcc 4.9.2 doesn't seem to do this:

g++ -S -O3 floop.cpp

.L2:
    cvtdq2ps    %xmm1, %xmm0
    mulps   %xmm0, %xmm0
    addq    $16, %rax
    paddd   %xmm2, %xmm1
    movaps  %xmm0, -16(%rax)
    cmpq    %rbp, %rax
    jne .L2

Nor does clang (3.7.0 from about 3 weeks ago)

 clang++ -S -O3 floop.cpp


    movdqa  .LCPI0_0(%rip), %xmm0   # xmm0 = [0,1,2,3]
    xorl    %eax, %eax
    .align  16, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movd    %eax, %xmm1
    pshufd  $0, %xmm1, %xmm1        # xmm1 = xmm1[0,0,0,0]
    paddd   %xmm0, %xmm1
    cvtdq2ps    %xmm1, %xmm1
    mulps   %xmm1, %xmm1
    movaps  %xmm1, (%rsp,%rax,4)
    addq    $4, %rax
    cmpq    $100, %rax
    jne .LBB0_1

Code that I have compiled:

extern int printf(const char *, ...);

int main()
{
    int i;
    float a[100];

    for (i=0;i<100;i++)
        a[i]= (float) i*i;

    for (i=0; i < 100; i++)
        printf("%f\n", a[i]);
}

(I added the printf to avoid the compiler getting rid of ALL the code)