Here's what I found on Skylake for the same loop. All the code to reproduce my tests on your hardware is on github.
I observe three different performance levels based on alignment, whereas the OP only really saw 2 primary ones. The levels are very distinct and repeatable2:
We see three distinct performance levels here (the pattern repeats starting from offset 32), which we'll call regions 1, 2 and 3, from left to right (region 2 is split into two parts straddling region 3). The fastest region (1) is from offset 0 to 8, the middle (2) region is from 9-18 and 28-31, and the slowest (3) is from 19-27. The difference between each region is close to or exactly 1 cycle/iteration.
Based on the performance counters, the fastest region is very different from the other two:
- All the instructions are delivered from the legacy decoder, not from the DSB1.
- There are exactly 2 decoder <-> microcode switches (idq_ms_switches) for every iteration of the loop.
On the hand, the two slower regions are fairly similar:
- All the instructions are delivered from the DSB (uop cache), and not from the legacy decoder.
- There are exactly 3 decoder <-> microcode switches per iteration of the loop.
The transition from the fastest to the middle region, as the offset changes from 8 to 9, corresponds exactly to when the loop starts fitting in the uop buffer, because of alignment issues. You count this out in exactly the same way as Peter did in his answer:
Offset 8:
LSD? <_start.L37>:
ab 1 4000a8: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000ac: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000b1: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000b5: 72 21 jb 4000d8 <_start.L36>
ab 2 4000b7: 31 d2 xor edx,edx
ab 2 4000b9: 48 89 d8 mov rax,rbx
ab 3 4000bc: 48 f7 f1 div rcx
!!!! 4000bf: 48 85 d2 test rdx,rdx
4000c2: 74 0d je 4000d1 <_start.L30>
4000c4: 48 83 c1 01 add rcx,0x1
4000c8: 79 de jns 4000a8 <_start.L37>
In the first column I've annotated how the uops for each instruction end up in the uop cache. "ab 1" means they go in the set associated with address like ...???a?
or ...???b?
(each set covers 32 bytes, aka 0x20
), while 1 means way 1 (out of a max of 3).
At the point !!! this busts out of the uop cache because the test
instruction has no where to go, all the 3 ways are used up.
Let's look at offset 9 on the other hand:
00000000004000a9 <_start.L37>:
ab 1 4000a9: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000ad: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000b2: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000b6: 72 21 jb 4000d9 <_start.L36>
ab 2 4000b8: 31 d2 xor edx,edx
ab 2 4000ba: 48 89 d8 mov rax,rbx
ab 3 4000bd: 48 f7 f1 div rcx
cd 1 4000c0: 48 85 d2 test rdx,rdx
cd 1 4000c3: 74 0d je 4000d2 <_start.L30>
cd 1 4000c5: 48 83 c1 01 add rcx,0x1
cd 1 4000c9: 79 de jns 4000a9 <_start.L37>
Now there is no problem! The test
instruction has slipped into the next 32B line (the cd
line), so everything fits in the uop cache.
So that explains why stuff changes between the MITE and DSB at that point. It doesn't, however, explain why the MITE path is faster. I tried some simpler tests with div
in a loop, and you can reproduce this with simpler loops without any of the floating point stuff. It's weird and sensitive to random other stuff you put in the loop.
For example this loop also executes faster out of the legacy decoder than the DSB:
ALIGN 32
<add some nops here to swtich between DSB and MITE>
.top:
add r8, r9
xor eax, eax
div rbx
xor edx, edx
times 5 add eax, eax
dec rcx
jnz .top
In that loop, adding the pointless add r8, r9
instruction, which doesn't really interact with the rest of the loop, sped things up for the MITE version (but not the DSB version).
So I think the difference between region 1 an region 2 and 3 is due to the former executing out of the legacy decoder (which, oddly, makes it faster).
Let's also take a look at the offset 18 to offset 19 transition (where region2 ends and 3 starts):
Offset 18:
00000000004000b2 <_start.L37>:
ab 1 4000b2: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000b6: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000bb: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000bf: 72 21 jb 4000e2 <_start.L36>
cd 1 4000c1: 31 d2 xor edx,edx
cd 1 4000c3: 48 89 d8 mov rax,rbx
cd 2 4000c6: 48 f7 f1 div rcx
cd 3 4000c9: 48 85 d2 test rdx,rdx
cd 3 4000cc: 74 0d je 4000db <_start.L30>
cd 3 4000ce: 48 83 c1 01 add rcx,0x1
cd 3 4000d2: 79 de jns 4000b2 <_start.L37>
Offset 19:
00000000004000b3 <_start.L37>:
ab 1 4000b3: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000b7: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000bc: 66 0f 2e f0 ucomisd xmm6,xmm0
cd 1 4000c0: 72 21 jb 4000e3 <_start.L36>
cd 1 4000c2: 31 d2 xor edx,edx
cd 1 4000c4: 48 89 d8 mov rax,rbx
cd 2 4000c7: 48 f7 f1 div rcx
cd 3 4000ca: 48 85 d2 test rdx,rdx
cd 3 4000cd: 74 0d je 4000dc <_start.L30>
cd 3 4000cf: 48 83 c1 01 add rcx,0x1
cd 3 4000d3: 79 de jns 4000b3 <_start.L37>
The only difference I see here is that the first 4 instructions in the offset 18 case fit into the ab
cache line, but only 3 in the offset 19 case. If we hypothesize that the DSB can only deliver uops to the IDQ from one cache set, this means that at some point one uop may be issued and executed a cycle earlier in the offset 18 scenario than in the 19 scenario (imagine, for example, that the IDQ is empty). Depending on exactly what port that uop goes to in the context of the surrounding uop flow, that may delay the loop by one cycle. Indeed, the difference between region 2 and 3 is ~1 cycle (within the margin of error).
So I think we can say that the difference between 2 and 3 is likely due to uop cache alignment - region 2 has a slightly better alignment than 3, in terms of issuing one additional uop one cycle earlier.
Some addition notes on things I checked that didn't pan out as being a possible cause of the slowdowns:
Despite the DSB modes (regions 2 and 3) having 3 microcode switches versus the 2 of the MITE path (region 1), that doesn't seem to directly cause the slowdown. In particular, simpler loops with div
execute in identical cycle counts, but still show 3 and 2 switches for DSB and MITE paths respectively. So that's normal and doesn't directly imply the slowdown.
Both paths execute essentially identical number of uops and, in particular, have identical number of uops generated by the microcode sequencer. So it's not like there is more overall work being done in the different regions.
There wasn't really an difference in cache misses (very low, as expected) at various levels, branch mispredictions (essentially zero3), or any other types of penalties or unusual conditions I checked.
What did bear fruit is looking at the pattern of execution unit usage across the various regions. Here's a look at the distribution of uops executed per cycle and some stall metrics:
+----------------------------+----------+----------+----------+
| | Region 1 | Region 2 | Region 3 |
+----------------------------+----------+----------+----------+
| cycles: | 7.7e8 | 8.0e8 | 8.3e8 |
| uops_executed_stall_cycles | 18% | 24% | 23% |
| exe_activity_1_ports_util | 31% | 22% | 27% |
| exe_activity_2_ports_util | 29% | 31% | 28% |
| exe_activity_3_ports_util | 12% | 19% | 19% |
| exe_activity_4_ports_util | 10% | 4% | 3% |
+----------------------------+----------+----------+----------+
I sampled a few different offset values and the results were consistent within each region, yet between the regions you have quite different results. In particular, in region 1, you have fewer stall cycles (cycles where no uop is executed). You also have significant variation in the non-stall cycles, although no clear "better" or "worse" trend is evident. For example, region 1 has many more cycles (10% vs 3% or 4%) with 4 uops executed, but the other regions largely make up for it with more cycles with 3 uops executed, and few cycles with 1 uop executed.
The difference in UPC4 that the execution distribution above implies fully explains the difference in performance (this is probably a tautology since we already confirmed the uop count is the same between them).
Let's see what toplev.py has to say about it ... (results omitted).
Well, toplev suggests that the primary bottleneck is the front-end (50+%). I don't think you can trust this because the way it calculates FE-bound seems broken in the case of long strings of micro-coded instructions. FE-bound is based on frontend_retired.latency_ge_8
, which is defined as:
Retired instructions that are fetched after an interval where the
front-end delivered no uops for a period of 8 cycles which was not
interrupted by a back-end stall. (Supports PEBS)
Normally that makes sense. You are counting instructions which were delayed because the frontend wasn't delivering cycles. The "not interrupted by a back-end stall" condition ensures that this doesn't trigger when the front-end isn't delivering uops simply because is the backend is not able to accept them (e.g,. when the RS is full because the backend is performing some low-throuput instructions).
It kind of seems for div
instructions - even a simple loop with pretty much just one div
shows:
FE Frontend_Bound: 57.59 % [100.00%]
BAD Bad_Speculation: 0.01 %below [100.00%]
BE Backend_Bound: 0.11 %below [100.00%]
RET Retiring: 42.28 %below [100.00%]
That is, the only bottleneck is the front-end ("retiring" is not a bottleneck, it represents the useful work). Clearly, such a loop is trivially handled by the front-end and is instead limited by the backend's ability to chew threw all the uops generated by the div
operation. Toplev might get this really wrong because (1) it may be that the uops delivered by the microcode sequencer aren't counted in the frontend_retired.latency...
counters, so that every div
operation causes that event to count all the subsequent instructions (even though the CPU was busy during that period - there was no real stall), or (2) the microcode sequencer might deliver all its ups essentially "up front", slamming ~36 uops into the IDQ, at which point it doesn't deliver any more until the div
is finished, or something like that.
Still, we can look at the lower levels of toplev
for hints:
The main difference toplev calls out between the regions 1 and region 2 and 3 is the increased penalty of ms_switches
for the latter two regions (since they incur 3 every iteration vs 2 for the legacy path. Internally, toplev
estimates a 2-cycle penalty in the frontend for such switches. Of course, whether these penalties actually slow anything down depends in a complex way on the instruction queue and other factors. As mentioned above, a simple loop with div
doesn't show any difference between the DSB and MITE paths, a loop with additional instructions does. So it could be that the extra switch bubble is absorbed in simpler loops (where the backend processing of all the uops generated by the div
is the main factor), but once you add some other work in the loop, the switches become a factor at least for the transition period between the div
and non-div` work.
So I guess my conclusion is that the way the div instruction interacts with the rest of the frontend uop flow, and backend execution, isn't completely well understood. We know it involves a flood of uops, delivered both from the MITE/DSB (seems like 4 uops per div
) and from the microcode sequencer (seems like ~32 uops per div
, although it changes with different input values to the div
op) - but we don't know what those uops are (we can see their port distribution though). All that makes the behavior fairly opaque, but I think it is probably down to either the MS switches bottlnecking the front-end, or slight differences in the uop delivery flow resulting in different scheduling decisions which end up making the MITE order master.
1 Of course, most of the uops are not delivered from the legacy decoder or DSB at all, but by the microcode sequencer (ms). So we loosely talk about instructions delivered, not uops.
2 Note that the x axis here is "offset bytes from 32B alignment". That is, 0 means the top of the loop (label .L37) is aligned to a 32B boundary, and 5 means the loop starts five bytes below a 32B boundary (using nop for padding) and so on. So my padding bytes and offset are the same. The OP used a different meaning for offset, if I understand it correctly: his 1 byte of padding resulted in a 0 offset. So you would subtract 1 from the OPs padding values to get my offset values.
3 In fact, the branch prediction rate for a typical test with prime=1000000000000037
was ~99.999997%, reflecting only 3 mispredicted branches in the entire run (likely on the first pass through the loop, and the last iteration).
4 UPC, i.e., uops per cycle - a measure closely related to IPC for similar programs, and one that is a bit more precise when we are looking in detail at uop flows. In this case, we already know the uop counts are the same for all variations of alignment, so UPC and IPC will be directly proportional.
dpps
for 100K iterations and my counters give 700K uops, of which:idq.dsb_uops 499966284
andidq.ms_dsb_uops 200000595
. – Iwillnotexist Idonotexistloop: div rcx; dec rcx; jne loop
and iterated 100M times dividing zero by a counter. The damage is 3.7B uops, of which 3.2B were fed into DSB by the microcode sequencer and 0.5B came direct from DSB. 0 came from LSD. – Iwillnotexist Idonotexistdec+jne
fused for 100M uops in DSB, the first 4 uops of division also exist in DSB, but the 32 remaining are bottlenecked on the MS. This, combined with the fact that Haswell's division is 36 uops and evenly spreadp0 p1 p5 p6
(All of which have integer ALUs and of which p6 is a port for predicted-taken branches), makes me think that internally, division executes a high-radix, 4-uop/iteration loop producing ~8 bits at a time of the quotient. – Iwillnotexist Idonotexistrep movs
startup) aren't subject to dynamic branch prediction by the usual branch-prediction hardware (and this is why it has such high startup overhead even when used repeatedly, as Andy Glew (designer of the original P6 rep-string implementation) explained). They don't mispredict AFAIK, so maybe microcode branches are special and aren't speculatively executed? Obviously they can loop efficiently, though. – Peter Cordes