Indexed branch overhead on X86 64 bit mode

Question

This is a follow up to some comments made in this prior thread:

The following code snippets calculate Fibonacci, the first example with a loop, the second example with a computed jump (indexed branch) into an unfolded loop. This was tested using Visual Studio 2015 Desktop Express on Windows 7 Pro 64 bit mode with an Intel 3770K 3.5ghz processor. With a single loop testing fib(0) thru fib(93), the best time I get for loop version is ~1.901 microseconds, and for computed jump is ~ 1.324 microseconds. Using an outer loop to repeat this process 1,048,576 times, the loop version takes about 1.44 seconds, the computed jump takes about 1.04 seconds. In both sets of tests, the loop version is about 40% slower than computed jump version.

Question: Why is the loop version much more sensitive to code location than the computed jump version? In prior tests, some code location combinations caused the loop version time to increase from about 1.44 seconds to 1.93 seconds, but I never found a combination that significantly affected the computed jump version time.

Partial answer: The computed jump version branches into 94 possible target locations within a 280 byte range, and apparently a branch target buffer (cache) does a good job of optimizing this. For the loop version, using align 16 to put the assembly based fib() function on a 16 byte boundary solved the loop version time issue for most cases, but some changes to main() were still affecting the time. I need to find a reasonably small and repeatable test case.

loop version (note I've read that | dec | jnz | is faster than | loop |) :

        align   16
fib     proc                            ;rcx == n
        mov     rax,rcx                 ;br if < 2
        cmp     rax,2
        jb      fib1
        mov     rdx,1                   ;set rax, rdx
        and     rax,rdx
        sub     rdx,rax
        shr     rcx,1
fib0:   add     rdx,rax
        add     rax,rdx
        dec     rcx
        jnz     fib0
fib1:   ret     
fib     endp

computed jump (indexed branch) into unfolded loop version:

        align   16
fib     proc                            ;rcx == n
        mov     r8,rcx                  ;set jmp adr
        mov     r9,offset fib0+279
        lea     r8,[r8+r8*2]
        neg     r8
        add     r8,r9
        mov     rax,rcx                 ;set rax,rdx
        mov     rdx,1
        and     rax,rdx
        sub     rdx,rax
        jmp     r8
fib0:   ; assumes add xxx,xxx takes 3 bytes
        rept    46
        add     rax,rdx
        add     rdx,rax
        endm
        add     rax,rdx
        ret
fib     endp

Test code that runs 1 million (1048576) loops to calculate fib(0) to fib(93) using multiples of 37%93 so the order is not sequential. On my system, the loop version took about 1.44 seconds and the indexed branch version took about 1.04 seconds.

#include <stdio.h>
#include <time.h>

typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;

extern "C" uint64_t fib(uint64_t);

/* multiples of 37 mod 93 + 93 at end */
static uint64_t a[94] = 
     {0,37,74,18,55,92,36,73,17,54,
     91,35,72,16,53,90,34,71,15,52,
     89,33,70,14,51,88,32,69,13,50,
     87,31,68,12,49,86,30,67,11,48,
     85,29,66,10,47,84,28,65, 9,46,
     83,27,64, 8,45,82,26,63, 7,44,
     81,25,62, 6,43,80,24,61, 5,42,
     79,23,60, 4,41,78,22,59, 3,40,
     77,21,58, 2,39,76,20,57, 1,38,
     75,19,56,93};

/* x used to avoid compiler optimizing out result of fib() */
int main()
{
size_t i, j;
clock_t cbeg, cend;
uint64_t x = 0;
    cbeg = clock();
    for(j = 0; j < 0x100000; j++)
        for(i = 0; i < 94; i++)
            x += fib(a[i]);
    cend = clock();
    printf("%llx\n", x);
    printf("# ticks = %u\n", (uint32_t)(cend-cbeg));
    return 0;
}

The output for x is 0x812a62b1dc000000. The sum of fib(0) to fib(93) in hex is 0x1bb433812a62b1dc0, and add 5 more zeros for looping 0x100000 times: 0x1bb433812a62b1dc000000. The upper 6 nibbles are truncated due to 64 bit math.

I made an all assembly version to better control code location. The "if 1" is changed to "if 0" for loop version. The loop version takes about 1.465 to 2.000 seconds depending on nop padding used to put key locations on even or odd 16 byte boundaries (see comments below). The computed jump version takes about 1.04 seconds and boundaries make less than 1% difference in timing.

        includelib msvcrtd
        includelib oldnames

        .data
; multiples of 37 mod 93 + 93 at the end
a       dq      0,37,74,18,55,92,36,73,17,54
        dq     91,35,72,16,53,90,34,71,15,52
        dq     89,33,70,14,51,88,32,69,13,50
        dq     87,31,68,12,49,86,30,67,11,48
        dq     85,29,66,10,47,84,28,65, 9,46
        dq     83,27,64, 8,45,82,26,63, 7,44
        dq     81,25,62, 6,43,80,24,61, 5,42
        dq     79,23,60, 4,41,78,22,59, 3,40
        dq     77,21,58, 2,39,76,20,57, 1,38
        dq     75,19,56,93
        .data?
        .code
;       parameters      rcx,rdx,r8,r9
;       not saved       rax,rcx,rdx,r8,r9,r10,r11
;       code starts on 16 byte boundary
main    proc
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbp
        mov     rbp,rsp
        and     rsp,0fffffffffffffff0h
        sub     rsp,64
        mov     r15,offset a
        xor     r14,r14
        mov     r11,0100000h
;       nop padding effect on loop version (with 0 padding in padx below)
;        0 puts main2 on  odd 16 byte boundary  clk = 0131876622h => 1.465 seconds
;        9 puts main1 on  odd 16 byte boundary  clk = 01573FE951h => 1.645 seconds
        rept    0
        nop
        endm
        rdtsc
        mov     r12,rdx
        shl     r12,32
        or      r12,rax
main0:  xor     r10,r10
main1:  mov     rcx,[r10+r15]
        call    fib
main2:  add     r14,rax
        add     r10,8
        cmp     r10,8*94
        jne     main1
        dec     r11
        jnz     main0
        rdtsc
        mov     r13,rdx
        shl     r13,32
        or      r13,rax
        sub     r13,r12
        mov     rdx,r14
        xor     rax,rax
        mov     rsp,rbp
        pop     rbp
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        ret
main    endp

        align   16
padx    proc
;       nop padding effect on loop version with 0 padding above
;        0 puts fib on  odd 16 byte boundary    clk = 0131876622h => 1.465 seconds
;       16 puts fib on even 16 byte boundary    clk = 01A13C8CB8h => 2.000 seconds
;       nop padding effect on computed jump version with 9 padding above
;        0 puts fib on  odd 16 byte boundary    clk = 00D979792Dh => 1.042 seconds
;       16 puts fib on even 16 byte boundary    clk = 00DA93E04Dh => 1.048 seconds
        rept    0
        nop
        endm
padx    endp

        if      1       ;0 = loop version, 1 = computed jump version

fib     proc                            ;rcx == n
        mov     r8,rcx                  ;set jmp adr
        mov     r9,offset fib0+279
        lea     r8,[r8+r8*2]
        neg     r8
        add     r8,r9
        mov     rax,rcx                 ;set rax,rdx
        mov     rdx,1
        and     rax,rdx
        sub     rdx,rax
        jmp     r8
fib0:   ; assumes add xxx,xxx takes 3 bytes
        rept    46
        add     rax,rdx
        add     rdx,rax
        endm
        add     rax,rdx
        ret
fib     endp

        else

fib     proc                            ;rcx == n
        mov     rax,rcx                 ;br if < 2
        cmp     rax,2
        jb      fib1
        mov     rdx,1                   ;set rax, rdx
        and     rax,rdx
        sub     rdx,rax
        shr     rcx,1
fib0:   add     rdx,rax
        add     rax,rdx
        dec     rcx
        jnz     fib0
fib1:   ret     
fib     endp

        endif
        end

@PeterCordes - I updated my post. Apparently code alignment affected the loop version. So with forced alignment, it's back to 1.43 seconds for loop and 1.03 seconds for indexed branch — rcgldr
Modern branch prediction approaches, including for indirect branches, such as TAGE and ITAGE potentially spread out the history for one branch across a huge amount of internal state. In prototypical designs, a branch with enough distinct histories could use a large fraction of the branch predictor for its predictions (limited by the number of ways). That's one of the secret sauces behind modern predictors: the storage per PC isn't limited some some small fraction of the BP state, but can expand if useful. — BeeOnRope
One way you can do it for both conditional and indirect branches is just to take a hash of the path history. This is basically just a hash of the targets of the last N jumps. For conditional branches that's different but similarly powerful to the to "1 bit taken/not per branch" approach. It can handle cases where control flow converges (e.g., two branches at different PCs jump to the same location and then there is another branch): it keeps those separate while T/N approach would consider them the same. On the other hand ... — BeeOnRope
@PeterCordes - in the same case for conditional branches, with random input, I have to make a loop have a period of many thousands before the predictor starts failing. A periodic loop of 1000 random conditional branches is predicted very successfully, for example. History length doesn't need to be close to 1000 bits: it just needs to be long enough to identify uniquely the 1000 positions in the period, which is probably something like lg(1000) + 1 = ~11 bits, for a reasonable rate. Loops exits predictions don't get close to 1000 because the history is low entropy: they are a worst case. — BeeOnRope
FWIW, you observed a miss rate of about 1.8% in the period 94 loop. To get a "spurious aliasing" rate of 1.8% needs a lookup table of about ~5000 elements, which means a minimum history size of just over 12 bits. Even in the aliasing case, you have perhaps a 50% chance of getting the target right, since you'll usually alias with 1 other branch and they implement "anti ping-pong" algorithm, so the actual table is is probably half that. The total table storage is what limits these totally random cases though, and having ~2500 i-branch entries seems reasonable on Skylake. — BeeOnRope

Peter Cordes Peter Cordes · Accepted Answer · 2017-11-01T03:55:09

This was an answer to the original question, about why the loop takes 1.4x the time of the computed-jump version when the result is totally unused. IDK exactly why accumulating the result with a 1-cycle add loop-carried dependency chain would make so much difference. Interesting things to try: store it to memory (e.g. assign it to a volatile int discard) so the asm dep chain doesn't just end with a clobbered register. HW might possibly optimize that (e.g. discard uops once it's sure the result is dead). Intel says Sandybridge-family can do that for one of the flag-result uops in shl reg,cl.

Old answer: Why the computed jump is 1.4x faster than the loop with the result unused

You're testing throughput here, not latency. In our earlier discussion, I was mostly focusing on latency. That may have been a mistake; throughput impact on the caller can often be as relevant as latency, depending on how much of what the caller does after has a data dependency on the result.

Out-of-order execution hides the latency because the result of one call isn't an input dependency for the arg to the next call. And IvyBridge's out-of-order window is large enough to be useful here: 168-entry ROB (from issue to retirement), and 54-entry scheduler (from issue to execute), and a 160-entry physical register file. See also PRF vs. ROB limits for OOO window size.

OOO execution also hides the cost of the branch-mispredict before any Fib work gets done. Work from the last fib(n) dep chain is still in flight and being worked on during that mispredict. (Modern Intel CPUs only roll back to the mispredicted branch, and can keep executing uops from before the branch while the mispredict is being resolved.)

It makes sense that the computed-branch version is good here, because you're mostly bottlenecked on uop throughput, and the mispredict from the loop-exit branch costs about the same as the indirect-branch mispredict on entry to the unrolled version. IvB can macro-fuse the sub/jcc into a single uop for port 5, so the 40% number matches up pretty well. (3 ALU execution units, so spending 1/3 or your ALU execution throughput on loop overhead explains it. Branch-mispredict differences and the limits of OOO execution explain the rest)

I think in most real use-cases, latency might will relevant. Maybe throughput will still be most important, but anything other than this will make latency more important, because this doesn't even use the result at all. Of course, it's normal that there will be previous work in the pipeline that can be worked on while an indirect-branch mispredict is recovered from, but this will delay the result being ready which might mean stalls later if most of the instructions after fib() returns are dependent on the result. But if they aren't (e.g. a lot of reloads and calculations of addresses for where to put the result), having the front-end start issuing uops from after fib() sooner is a good thing.

I think a good middle ground here would be an unroll by 4 or 8, with a check before the unrolled loop to make sure it should run once. (e.g. sub rcx,8 / jb .cleanup).

Also note that your looping version has a data dependency on n for the initial values. In our earlier discussion, I pointed out that avoiding this would be better for out-of-order execution, because it lets the add chain start working before n is ready. I don't think that's a big factor here, because the caller has low latency for n. But it does put the loop-branch mispredict on exiting the loop at the end of the n -> fib(n) dep chain instead of in the middle. (I'm picturing a branchless lea / cmov after the loop to do one more iteration if sub ecx, 2 went below zero instead of to zero.)

Indexed branch overhead on X86 64 bit mode

1 Answers