1
votes

I am analyzing Agner Fog's "Optimizing subroutines in assembly language: An optimization guide for x86 platforms". Especially I am trying to understand chapter 12.7. And there is an issue I can not understand. The author writes:

Instruction decoding in the PM processor follows the 4-1-1 pattern. The pattern of (fused) μops for each instruction in the loop in example 12.6b is 2-2-2-2-2-1-1-1. This is not optimal, and it will take 6 clock cycles to decode. This is more than the retirement time, so we can conclude that instruction decoding is the bottleneck in example 12.6b. The total execution time is 6 clock cycles per iteration or 3 clock cycles per calculated Y[i] value.

  1. What does it mean that instruction decoding follows the 4-1-1 pattern and how to know it?
  2. Pattern for loop is 2-2-2-2-2-1-1-1. Ok, but why it takes 6 cycle to decode I don't know. Why?
1

1 Answers

4
votes
  1. The CPU's frontend can decode multiple (macro) instructions in one clock cycle. Each macro instruction decodes to 1 or more micro-ops (μops). What the 4-1-1 pattern means is that the first parallel decoder can handle a complex instruction that decodes to up to 4 μops. But the second and third parallel decoders can only handle instructions that decode to 1 μop each (if not satisfied, they don't consume the instruction).

  2. The 5 instructions that decode to 2 μops will must be consumed by the first decoder, then the tail allows some parallelism.

    2 2 2 2 2 1 1 1 (Macro-instruction stream, μops per instruction)
    ^ x x
    4 1 1  (Decode cycle 0)
    
    . 2 2 2 2 1 1 1
      ^ x x
      4 1 1  (Decode cycle 1)
    
    . . 2 2 2 1 1 1
        ^ x x
        4 1 1  (Decode cycle 2)
    
    . . . 2 2 1 1 1
          ^ x x
          4 1 1  (Decode cycle 3)
    
    . . . . 2 1 1 1
            ^ ^ ^
            4 1 1  (Decode cycle 4)
    
    . . . . . . . 1
                  ^ x x
                  4 1 1  (Decode cycle 5)
    
    . . . . . . . . (Instruction stream fully consumed)