10
votes

As follow up to my question The advantages of using 32bit registers/instructions in x86-64, I started to measure the costs of instructions. I'm aware that this have been done multiple times (e.g. Agner Fog), but I'm doing it for fun and self education.

My testing code is pretty simple (for simplicity here as pseudo code, in reality in assembler):

for(outer_loop=0; outer_loop<NO;outer_loop++){
    operation  #first
    operation  #second
    ...
    operation #NI-th
} 

But yet some things should be considered.

  1. If the inner part of the loop is large (large NI>10^7), the whole content of the loop does not fit into the instruction cache and thus must be loaded over and over again, making the speed of RAM define the time needed for execution. For example, for large inner parts, xorl %eax, %eax (2 bytes) is 33% faster than xorq %rax, %rax (3 bytes).
  2. If NI is small and the whole loop fits easily into the instruction cache, than xorl %eax, %eax and xorq %rax, %rax are equally fast and can be executed 4 times per clock cycle.

However this simple model does not hold water for the jmp-instruction. For the jmp-instruction my test code looks as follows:

for(outer_loop=0; outer_loop<NO;outer_loop++){
    jmp .L0
    .L0: jmp .L1
    L1: jmp L2
    ....
}

And the results are:

  1. For "large" loop sizes (already for NI>10^4) I measure 4.2 ns/jmp-instruction ( would equate to 42 bytes loaded from RAM or ca. 12 clock cycles on my machine).
  2. For small loop sizes (NI<10^3) I measure 1 ns/jmp-instruction (which is around 3 clock cycles, which sounds plausible - Agner Fog's tables shows costs of 2 clock cycles).

The instruction jmp LX uses the 2 byte eb 00 encoding.

Thus, my question: What could be the explanation for the high cost of jmp-instruction in the "large" loops?

PS: If you like to try it out on your machine, you can download the scripts from here, just run sh jmp_test.sh in src-folder.


Edit: Experimental results confirming Peter's BTB size theory.

The following table shows cycles per instruction for different ǸI values (relative to NI=1000):

|oprations/ NI        | 1000 |  2000|  3000|  4000|  5000| 10000|
|---------------------|------|------|------|------|------|------|
|jmp                  |  1.0 |  1.0 |  1.0 |  1.2 |  1.9 |   3.8|
|jmp+xor              |  1.0 |  1.2 |  1.3 |  1.6 |  2.8 |   5.3|
|jmp+cmp+je (jump)    |  1.0 |  1.5 |  4.0 |  4.4 |  5.5 |   5.5|
|jmp+cmp+je (no jump) |  1.0 |  1.2 |  1.3 |  1.5 |  3.8 |   7.6|

It can be seen:

  1. For the jmp instruction, a (yet unknown) resource becomes scarce and this leads to a performance degradation for ǸI larger than 4000.
  2. This resource is not shared with such instructions as xor - the performance degradation kicks in still for NI about 4000, if jmp and xor are executed after each other.
  3. But this resource is shared with je if the jump is made - for jmp+je after each other, the resource becomes scarce for NI about 2000.
  4. However, if je does not jump at all, the resource is becoming scarce once again for NI being about 4000 (4th line).

Matt Godbolt's branch-prediction reverse engineering articles establishes that the branch target buffer capacity is 4096 entries. That is very strong evidence that BTB misses are the reason for the observed throughput difference between small and large jmp loops.

1
The names are in the debug info. Release executables won't have label names anywhere.doug65536
Note that xorq %rax,%rax does exactly the same thing as xorl %eax,%eax so there is almost never a reason to use the former (except perhaps to avoid having to insert a nop for alignment somewhere).fuz
Your "large" 10,000 instruction loops would easily fit into the L2 cache of a modern processor (256K), so you're not measuring the speed of RAM.Ross Ridge
@RossRidge You are right, for mov and xor I need to go as far as 10^7 instruction in the loop to see the "RAM-speed". However jmp becomes 4 time slower from 10^3 to 10^4. I'm not saying it is because of RAM - it is something different, but I don't quite know what it is.ead
You probably already understood it (since you wrote that test case in the first place), but it probably bears being explicit - the reason your jmp+cmp+je (no jump) case doesn't hit resource scarcity until around 4,000 jumps is because jumps that aren't taken don't consume a BTB entry (indeed, there would be nothing to put in the BTB!).BeeOnRope

1 Answers

16
votes

TL:DR: my current guess is running out of BTB (branch target buffer) entries. Pipelined code fetch needs to predict the existence of an unconditional branch before it's even decoded. See below.

2021 update: https://blog.cloudflare.com/branch-predictor/ explores this in detail, using a block of jmp next_insn as an experiment. Branch density, and aliasing (same offset relative to a 64-byte line) for example can matter.


Even though your jmps are no-ops, the CPU doesn't have extra transistors to detect this special-case. They're handled just like any other jmp, which means having to re-start instruction fetch from a new location, creating a bubble in the pipeline.

To learn more about jumps and their effect on pipelined CPUs, Control Hazards in a classic RISC pipeline should be a good intro to why branches are difficult for pipelined CPUs. Agner Fog's guides explain the practical implications, but I think assume some of that kind of background knowledge.


Your Intel Broadwell CPU has a uop-cache, which caches decoded instructions (separate from the 32kiB L1 I-cache).

The uop cache size is 32 sets of 8 ways, with 6 uops per line, for a total of 1536 uops (if every line is packed with 6 uops; perfect efficiency). 1536 uops is between your 1000 and 10000 test sizes. Before your edit, I predicted that the cutoff for slow to fast would be right around 1536 total instructions in your loop. It doesn't slow down at all until well beyond 1536 instructions, so I think we can rule out uop-cache effects. This isn't as simple a question as I thought. :)

Running from the uop-cache (small code size) instead of the x86 instruction decoders (large loops) means that there are fewer pipeline stages before the stage that recognizes jmp instructions. So we might expect the bubbles from a constant stream of jumps to be smaller, even though they're predicted correctly.

Running from the decoders is supposed to give a larger branch mispredict penalty (like maybe 20 cycles instead of 15), but these aren't mispredicted branches.


Even though the CPU doesn't need to predict whether the branch is taken or not, it still uses branch-prediction resources to predict that a block of code contains a taken branch before it's decoded.

Caching the fact that there is a branch in a certain block of code, and its target address, allows the frontend to start fetching code from the branch target before the jmp rel32 encoding is actually decoded. Remember that decoding variable-length x86 instructions is hard: you don't know where one instruction starts until the previous one is decoded. So you can't just pattern-match the instruction stream looking for unconditional jumps / calls as soon as its fetched.

My current theory is that you're slowing down when you run out of branch-target-buffer entries.

See also What branch misprediction does the Branch Target Buffer detect? which has a nice answer, and discussion in this Realworldtech thread.

One very important point: the BTB predicts in terms of which block to fetch next, rather than the exact destination of a specific branch within a fetch block. So instead of having to predict targets for all branches in a fetch block, the CPU just needs to predict the address of the next fetch.


Yes, memory bandwidth can be a bottleneck when running very high throughput stuff like xor-zeroing, but you're hitting a different bottleneck with jmp. The CPU would have time to fetch 42B from memory, but that's not what it's doing. Prefetch can easily keep up with 2 bytes per 3 clocks, so there should be near-zero L1 I-cache misses.

In your xor with/without REX test, main memory bandwidth might actually have been the bottleneck there if you tested with a large enough loop to not fit in L3 cache. I consumes 4 * 2B per cycle on a ~3GHz CPU, which does just about max out the 25GB/s of DDR3-1600MHz. Even the L3 cache would be fast enough to keep up with 4 * 3B per cycle, though.

That's interesting that main memory BW is the bottleneck; I initially guessed that decode (in blocks of 16 bytes) would be the bottleneck for 3-byte XORs, but I guess they're small enough.


Also note that its a lot more normal to measure times in core clock cycles. However, your measurements in ns are useful when you're looking at memory, I guess, because low clock speeds for power-saving change the ratio of core clock speed to memory speed. (i.e. memory bottlenecks are less of a problem at minimum CPU clock speed.)

For benchmarking in clock cycles, use perf stat ./a.out. There are other useful performance counters which are essential to trying to understand the performance characteristics.

See x86-64 Relative jmp performance for perf-counter results from Core2 (8 cycles per jmp), and some unknown microarchitecture where it's ~10c per jmp.


The details of modern CPU performance characteristics are hard enough to understand even under more or less white-box conditions (reading Intel's optimization manual, and what they've published regarding CPU internals). You're going to get stuck early and often if you insist on black-box testing where you don't read stuff like arstechnica articles about the new CPU design, or maybe some more detailed stuff like David Kanter's Haswell microarch overview, or the similar Sandybridge writeup I linked earlier.

If getting stuck early and often is ok and you're having fun, then by all means keep doing what you're doing. But it makes it harder for people to answer your questions if you don't know those details, like in this case. :/ e.g. my first version of this answer assumed you had read enough to know what the uop cache was.