Consider the following loop:
.loop:
add rsi, OFFSET
mov eax, dword [rsi]
dec ebp
jg .loop
where OFFSET
is some non-negative integer and rsi
contains a pointer to a buffer defined in the bss
section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. Presumably, on Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. Therefore, the only limit on the buffer size is the number of virtual pages. So we can easily experiment with very large buffers.
The loop consists of 4 instructions. Each instruction is decoded into a single uop in the fused and unfused domain on Haswell. There is also a loop-carried dependency between the successive instances of add rsi, OFFSET
. Therefore, under idle conditions where the load always hit in the L1D, the loop should execute at about 1 cycle per iteration. For small offsets (strides), this is expected thanks to the IP-based L1 streaming prefetcher and the L2 streaming prefetcher. However, both prefetchers can only prefetch within a 4K page and the maximum stride supported by the L1 prefetcher is 2K. So for small strides, there should be about 1 L1 miss per 4K page. As the stride increases, the total number of L1 misses and TLB misses will increase and performance will deteriorate accordingly.
The following graph shows various interesting performance counters (per iteration) for strides between 0 and 128. Note that the number of iterations is constant for all experiments. Only the buffer size changes to accommodate the specified stride. In addition, only user-mode performance events are counted.
The only weird thing here is the number of retired uops is increasing with the stride. It goes from 3 uops per iteration (as expected) to 11 for stride 128. Why is that?
Things only get weirder with larger strides, as the following graph shows. In this graph, the strides range from 32 to 8192 with 32-byte increments. First, the number of retired instructions increases linearly from 4 to 5 at stride 4096 bytes after which it remains constant. The number of load uops increases from 1 to 3 and the number of L1D load hits remains 1 per iteration. Only the number of L1D load misses makes sense to me for all strides.
The two obvious effects of larger strides are:
- The execution time increases and so more hardware interrupts will occur. However, I'm counting user-mode events, so interrupts should not interfere with my measurements. I've also repeated all experiments with
taskset
ornice
and got the same results. - The number of page walks and page faults increases. (I've verified this but I'll omit the graphs for brevity.) Page faults are handled by the kernel in kernel-mode. According to this answer, page walks are implemented using dedicated hardware (on Haswell?). Although the link that the answer is based on is dead.
To investigate further, the following graph shows the number of uops from microcode assists. The number of microcode assist uops per iteration increases until it reaches the maximum value at stride 4096, just like with the other performance events. The number of microcode assist uops per 4K virtual page is 506 for all strides. The "Extra UOPS" line plots the number of retired uops minus 3 (the expected number of uops per iteration).
The graph shows that the number of extra uops is slightly larger than half of the number of microcode assist uops for all strides. I don't know what this means, but it could be related to page walks and could be the reason for the observed perturbation.
Why are the numbers of retired instructions and uops per iteration increasing for larger strides even though the number of static instructions per iteration is the same? Where is the interference coming from?
The following graphs plot the number of cycles per iteration against the number of retired uops per iteration for different strides. The number of cycles increases much more quickly than the number of retired uops. By using linear regression, I found:
cycles = 0.1773 * stride + 0.8521
uops = 0.0672 * stride + 2.9277
Taking the derivatives of both functions:
d(cycles)/d(stride) = 0.1773
d(uops)/d(stride) = 0.0672
This means that the number of cycles increase by 0.1773 and the number of retired uops increase by 0.0672 with each 1 byte increment in stride. If interrupts and page faults were indeed the (only) cause of perturbation, shouldn't both rates be very close?
p0156
uops). So basically when a load feeds into a load only a load will be replayed since it's the only dependent op. If you have ALU ops after, the ALU ops will be replayed. Sometimes more than one uop is replayed including not-directly-dependent ones, it seems uops that would execute within one cycle of the load are replayed. – BeeOnRope