4
votes

I've been working on an Intel 8086 emulator for about a month now. I've decided to start counting cycles to make emulation more accurate and synchronize it correctly with the PIT.

The clock cycles used for each instruction are detailed in Intel's User Manual but I'd like to know how they're calculated. For example, I've deduced the following steps for the XCHG mem8,reg8 instruction - which takes exactly 17 clock cycles according to the manual:

  1. decode the second byte of the instruction: +1 cycle;
  2. transfer first operand from memory into a temporary location: +7 cycles;
  3. transfer second operand from register into memory destination: +8 cycles;
  4. transfer first operand from temporary location into register destination: +1 cycle.

But I'm probably completely wrong as my reasoning doesn't seem to work for all instructions. For instance, I can't comprehend why the PUSH reg instruction takes 11 clock cycles, whereas the POP reg instruction only takes 8 clock cycles.

So, could you tell me how clock cycles are spent in each instruction, or rather a general method to understand where those numbers come from?

Thank you.

2
@downvoter Could you tell me what was so wrong with my question that you had to downvote me? - neat
check "8086 tiny" sources. they might be interesting to you - Alexander Zhak
PUSH is basically a MOV from register to memory. POP is a MOV from memory to register. From the tables, the former is 9+EA, the latter 8+EA. Since you can POP with 0 EA (stack pointer is already pointing to where you will POP from) this can start immediately and the stack pointer decrement can (I guess) overlap the read cycle once it is no longer needed. For the PUSH operation there is 2 EA since the stack pointer must be incremented before issuing the MOV. I would suppose this is where the extra cycles come from. This is only speculation. I don't know this for certain. - J...
@J... I've read the manual linked in my OP. Besides the tables, it doesn't give much more information. I've also read this one but the explanations (page 107) do not match my observations nor the first one. - neat
The book you want is Michael Abrash's "Zen of Assembly Language," which is long out of print but still available on the used market. As I recall, you can't easily break those timings into sub-operations. In addition, there are hidden costs such as instruction prefetch and DMA refresh (admittedly platform specific) that you have to take into account. The official instruction timings tell you what's happening on the CPU but they're "best case," assuming that the supporting hardware doesn't add anything. - Jim Mischel

2 Answers

4
votes

How are cycles calculated and what does actually the clock do was a mystery to me as well, until I had the chance to work together with hardware guys and I could see what kind of models they work with. The answer lies in the hardware

CPU is parallel machine and although to programmers it's design is usually described in some simplifying terms explaining the pipeline or the microinstructions needed to implement it etc. CPU remains to be parallel machine.

For an instruction to complete, many tiny bit-size signals must flow through from one end to another. At some spots the processing units must wait till all the input bits arrive. This coordinated movement from one stage to another is driven by the clock-signal which is sent centrally to all the many parts. Each such move drummed by the clock-signal is called cycle.

So in order to know how many cycles are really needed to finish the work, you must take into account how are the wires connected and where the bits must flow through and where and how many are the required synchronization points.

enter image description here

I doubt if the Intel 8086 schematic is publicly available and even if it was then I doubt that it would be readable. But the only correct answer lies there. Everything else is just a simplification and to reproduce the exact hardware behavior in software, you would have to simulate/interpret the CPUs hardware

See also:

2
votes

The question is quite broad so I will only address the PUSH vs POP question here.

PUSH is basically a MOV from register to memory (plus register increment). POP is a MOV from memory to register (plus register decrement).

If you look at page 2-61 you will find :

MOV

register, memory 8+EA 1 2-4 MOV BP, STACK_TOP

memory, register 9+EA 1 2-4 MOV COUNT [DI], CX

For the POP operation, you already have the stack pointer in a register, so the effective address (EA) is zero. You can perform the MOV immediately and I can only assume that the special POP operation can decrement the stack pointer at the same time, somewhere in the later clock cycles of the read operation once the address is no longer needed.

For the PUSH operation you have an EA of 2 since the stack pointer must be incremented before obtaining the required address to perform the write. There can be no concurrency leveraged here so you have the 9 cycles for the MOV plus, seemingly, two for the effective address calculation (stack pointer increment).