To get an understanding on why Bulldozer was subpar I've been looking at Agner Fog's excellent microarchitecture book, in it on page 178 under bulldozer it has this paragraph.
Instructions with up to three prefixes can be decoded in one clock cycle. There is a very large penalty for instructions with more than three prefixes. Instructions with 4-7 prefixes take 14-15 clock cycles extra to decode. Instructions with 8-11 prefixes take 20-22 clock cycles extra, and instructions with 12-14 prefixes take 27 - 28 clock cycles extra. It is therefore not recommended to make NOP instructions longer with more than three prefixes. The prefix count for this rule includes operand size, address size, segment, repeat, lock, REX and XOP prefixes. A three-bytes VEX prefix counts as one, while a two-bytes VEX prefix does not count. Escape codes (0F, 0F38, 0F3A) do not count.
When I searched for prefixes I was hit with very technical definitions far and away beyond my abilities. Or, suggested that they were limited to 4 per instruction which conflicts with the above extract.
So in simple terms, can someone explain what they are/do and why you might want to tack on up to 14+ onto an instruction instead of breaking it up?
NOP
. OneNOP
takes the same time to execute regardless of length, other than code-size side-effects and frontend issues. (As Agner Fog's guide explains). You definitely don't want 14 NOPs wasting space in the uop cache on a CPU that uses a uop-cache. Other than that, well, the x32 ABI often uses address-size prefixes (so base+index*scale addressing modes don't accidentally go outside the 32bit address range). solock inc word [edi + r10d*4]
would need 4: lock op-sz addr-sz REX. – Peter Cordes