x86-64 Relative jmp performance

Question

I'm currently doing an assignment that measures the performance of various x86-64 commands (at&t syntax).

The command I'm somewhat confused on is the "unconditional jmp" command. This is how I've implemented it:

    .global uncond
uncond:

.rept 10000
jmp . + 2
.endr


mov $10000, %rax
ret

It's fairly simple. The code creates a function called "uncond" which uses the .rept directive to call the jmp command 10000 times, then sets the return value to the number of times you called the jmp command.

"." in at&t syntax means the current address, which I increase by 2 bytes in order to account for the jmp instruction itself (so jmp . + 2 should simply move to the next instruction).

Code that I haven't shown calculate the number of cycles it takes to process the 10000 commands.

My results say jmp is pretty slow (takes 10 cycles to process a single jmp instruction) - but from what I understand about pipelining, unconditional jumps should be very fast (no branch prediction errors).

Am I missing something? Is my code wrong?

Possible duplicate of Slow jmp-instruction. That more detailed question has a much better and more detailed answer. — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2016-04-24T05:38:07

The CPU isn't optimized for no-op jmp instructions, so it doesn't handle the special case of continuing to decode and pipeline jmp instructions that just jump to the next insn.

CPUs are optimized for loops, though. jmp . will run at one insn per clock on many CPUs, or one per 2 clocks on some CPUs.

A jump creates a bubble in instruction fetching. A single well-predicted jump is ok, but running nothing but jumps is problematic. I reproduced your results on a core2 E6600 (Merom/Conroe microarch):

# jmp-test.S
.globl _start
_start:

    mov $100000, %ecx
jmp_test:
    .rept 10000
    jmp . + 2
    .endr

    dec %ecx
    jg jmp_test


    mov $231, %eax
    xor %ebx,%ebx
    syscall          #  exit_group(0)

build and run with:

gcc -static -nostartfiles jmp-test.S
perf stat -e task-clock,cycles,instructions,branches,branch-misses ./a.out

 Performance counter stats for './a.out':

       3318.616490      task-clock (msec)         #    0.997 CPUs utilized          
     7,940,389,811      cycles                    #    2.393 GHz                      (49.94%)
     1,012,387,163      instructions              #    0.13  insns per cycle          (74.95%)
     1,001,156,075      branches                  #  301.679 M/sec                    (75.06%)
           151,609      branch-misses             #    0.02% of all branches          (75.08%)

       3.329916991 seconds time elapsed

From another run:

 7,886,461,952      L1-icache-loads           # 2377.687 M/sec                    (74.95%)
     7,715,854      L1-icache-load-misses     #    2.326 M/sec                    (50.08%)
 1,012,038,376      iTLB-loads                #  305.119 M/sec                    (75.06%)
           240      iTLB-load-misses          #    0.00% of all iTLB cache hits   (75.02%)

(Numbers in (%) at the end of each line are how much of the total run time that counter was active for: perf has to multiplex for you when you ask it to count more things than the HW can count at once).

So it's not actually I-cache misses, it's just instruction fetch/decode frontend bottlenecks caused by constant jumps.

My SnB machine is broken, so I can't test numbers on it, but 8 cycles per jmp sustained throughput is pretty close to your results (which were probably from a different microarchitecture).

For more details, see http://agner.org/optimize/, and other links from the x86 tag wiki.

x86-64 Relative jmp performance

1 Answers