I was looking the Agner Fog's instruction tables here, specifically I was looking at the sandy bridge case, and there is one thing that has caught my attention. If you look DIV instructions you can see that, for example, r64 DIV instruction can be decoded up to 56 uops! My question is: is it true or have I made a missinterpretation?
This is something that doesn't even get into my head. I've always thougt that an integer division of 2 registers was decoded in only 1 uop. And thought that that uop was dispatched to Port 0 (for example in Sandy Bridge).
What I thought that happenned here is: The uop is dispatched to Port0 and it finishes some cycles later. But, thanks to the pipelining, 1 div uop (or another uop that needs port0) can be sent to that port on each cycle. But this has completely broken my schemes: 56 different uops which need to be dispatched in 56 different cycles and occuping 56 ROB entries to ONLY do 1 integer division?