I am analyzing Agner Fog's "Optimizing subroutines in assembly language: An optimization guide for x86 platforms". Especially I am trying to understand chapter 12.7. And there is an issue I can not understand. The author writes:
Instruction decoding in the PM processor follows the 4-1-1 pattern. The pattern of (fused) μops for each instruction in the loop in example 12.6b is 2-2-2-2-2-1-1-1. This is not optimal, and it will take 6 clock cycles to decode. This is more than the retirement time, so we can conclude that instruction decoding is the bottleneck in example 12.6b. The total execution time is 6 clock cycles per iteration or 3 clock cycles per calculated Y[i] value.
- What does it mean that instruction decoding follows the 4-1-1 pattern and how to know it?
- Pattern for loop is 2-2-2-2-2-1-1-1. Ok, but why it takes 6 cycle to decode I don't know. Why?