0
votes

My CPU is IvyBridge. Let's consider example from Agner's Fog optimizing_assembly, I mean. 12.6 chapter and the 12.10a example:

    movsd xmm2, [x] 
    movsd xmm1, [one] 
    xorps xmm0, xmm0  
    mov eax, coeff    

    L1:
        movsd xmm3, [eax] ; c[i]
        mulsd xmm3, xmm1  ; c[i]*x^i
        mulsd xmm1, xmm2  ; x^(i+1)
        addsd xmm0, xmm3  ; sum += c[i]*x^i
        add   eax, 8
        cmp eax, coeff_end
        jb L1

And the frontend is not a bottleneck ( it is obvious because of the latency of multiplication).

We have two loop-carried dependency ( I skipped add eax, 8): 1. mulsd xmm1, xmm2 latency of 5 cycles 2. addsd xmm0, xmm3 latency of 3 cycles

Btw, I have a proble to decide: Should I sum up ( 5 + 3 = 8) or to get the greatest, i.e. 5 cycle?

I've tested it for 10000000 elements array. And it takes 6.7 cycle per iteration ( according to perf) and 5.9 cycles per iteration according to Agners' tool.

Please explain why it does take 6/7 cycles instead of just 5 cycles?

1
You can't ignore the cost of loading the data from RAM.Hans Passant
Can you try to write more useful question titles? "Loop too slow" doesn't tell anyone anything except that you should have used the [optimization] or [performance] tags. Some of your previous questions have had really terrible titles, too, like "Too much cycles".Peter Cordes
Looks like the longest loop-carried dep chain is the mulsd xmm1, xmm2. Next iteration's mul can start before this iteration's addsd finishes, because it doesn't depend on xmm0. The other instructions depend on that loop-carried mul chain, but they fork off a separate dep chain for each iteration (except for the addsd, which is also loop-carried, but lower latency).Peter Cordes
Thanks :). I improved my post :)Gilgamesz

1 Answers

1
votes

Probably resource conflicts are sometimes delaying the loop-carried mulsd. uop scheduling chooses oldest-first (out of uops that have their inputs ready), not critical-path-first, so the mulsd xmm3, xmm1 probably steals the execution port and delays the loop-carried mulsd xmm1, xmm2 by a cycle occasionally.

There are probably also occasional pipeline stalls from cache misses, since hardware prefetching doesn't work across page boundaries. Test for this by putting an outer repeat-loop around an inner loop over ~128kiB of data (1/2 L2 cache size).

prefetch should have no problem keeping up with your loop normally, though: One 64b load per 5 cycles is about 1/5th of main memory bandwidth.

You could also use float instead of double


Does your data have any denormals? Those are slow (but NaN isn't, with SSE).