24
votes

I am working through Kip Irvine's "Assembly Language for x86 Processors, sixth edition" and am really enjoying it.

I have just read about the NOP mnemonic in the following paragraph:

"It [NOP] is sometimes used by compilers and assemblers to align code to 
 even-address boundaries."

The example given is:

00000000   66 8B C3   mov ax, bx
00000003   90         nop
00000004   8B D1      mov edx, ecx

The book then states:

"x86 processors are designed to load code and data more quickly from even 
 doubleword addresses."

My question is: Is the reason this is so is because for the x86 processors the book refers to (32 bit), the word size of the CPU is 32 bits and therefore it can pull the instructions with the NOP in and process them in one go ? If this is the case, I am assuming that a 64 bit processor with a word size of a quadword would do this with a hypothetical 5 bytes of code plus a nop ?

Lastly, after I write my code, should I go through and correct alignment with NOP's to optimize it, or will the compiler (MASM, in my case), do this for me, as the text seems to imply ?

Thanks,

Scott

3
Everything you want to know about modern processors' architecture is at agner.org/optimize. The required alignment for instructions is independent of the word size, and it is 16 bytes for modern Intel processors. I don't want to ruin your fun, but you shouldn't trust a book that makes blanket statements about performance for "x86 processors". Each single model has different characteristics.Pascal Cuoq
Thanks for your comment! You haven't ruined my fun - the joy is in the learning and I just learned some more from you! Will check the website out, too.Scott Davies
This book looks horribly outdated. 16bit x86 is really ancient, TBH I don't see the value in teaching this stuff even for educational purposes. Maybe as a counter example how to not design a processor/assembly language.Gunther Piez

3 Answers

23
votes

Code that's executed on word (for 8086) or DWORD (80386 and later) boundaries executes faster because the processor fetches whole (D)words. So if your instructions aren't aligned then there is a stall when loading.

However, you can't dword-align every instruction. Well, I guess you could, but then you'd be wasting space and the processor would have to execute the NOP instructions, which would kill any performance benefit of aligning the instructions.

In practice, aligning code on dword (or whatever) boundaries only helps when the instruction is the target of a branching instruction, and compilers typically will align the first instruction of a function, but won't align branch targets that can also be reached by fall through. For example:

MyFunction:
    cmp ax, bx
    jnz NotEqual
    ; ... some code here
NotEqual:
    ; ... more stuff here

A compiler that generates this code will typically align MyFunction because it is a branch target (reached by call), but it won't align the NotEqual because doing so would insert NOP instructions that would have to be executed when falling through. That increases code size and makes the fall-through case slower.

I would suggest that if you're just learning assembly language, that you don't worry about things like this that will most often give you marginal performance gains. Just write your code to make things work. After they work, you can profile them and, if you think it's necessary after looking at the profile data, align your functions.

The assembler typically won't do it for you automatically.

7
votes

Because the (16 bit) processor can fetch values from memory only at even addresses, due to its particular layout: it is divided in two "banks" of 1 byte each, so half of the data bus is connected to the first bank and the other half to the other bank. Now, suppose these banks are aligned (as in my picture), the processor can fetch values that are on the same "row".

  bank 1   bank 2
+--------+--------+
|  8 bit | 8 bit  |
+--------+--------+
|        |        |
+--------+--------+
| 4      | 5      | <-- the CPU can fetch only values on the same "row"
+--------+--------+
| 2      | 3      |
+--------+--------+
| 0      | 1      |
+--------+--------+
 \      / \      /
  |    |   |    |
  |    |   |    |

 data bus  (to uP)

Now, since this fetch limitation, if the cpu is forced to fetch values which are located on an odd address (suppose 3), it has to fetch values at 2 and 3, then values at 4 and 5, throw away values 2 and 5 then join 4 and 3 (you are talking about x86, which as a little endian memory layout).
That's why is better having code (and data!) on even addresses.

PS: On 32 bit processors, code and data should be aligned on addresses which are divisible by 4 (since there are 4 banks).

Hope I was clear. :)

2
votes

The problem is not limited only to instruction fetches. And it is unfortunate that programmers are not made aware of this early and punished for it often. The x86 architecture has made folks lazy. It makes it difficult when transitioning to other architectures.

It has every thing to do with the nature of the data bus. When you have for example a 32 bit wide data bus, a read from memory is aligned on that boundary. In this case the lower two address bits are normally ignored as they have no meaning. So if you were to perform a 32 bit read from address 0x02, be it part of an instruction fetch or a read from memory. Then two memory cycles are required, a read from address 0x00 to get two of the bytes and a read from 0x04 to get the other two bytes. Taking twice as long, stalling the pipeline if this is an instruction fetch. The performance hit is dramatic and in no way a wasted optimization for data reads. Programs that align their data on natural boundaries and adjust structures and other items in integer multiples of these sizes, can see as much as double the performance without any other effort. Similarly using an int instead of a char for a variable even if it is only going to count to 10 can be faster. It is true that adding nops to programs to align branch destinations is usually not worth the effort. Unfortunately x86 is variable word length, byte based, and you constantly suffer these inefficiencies. If you are painted into a corner and need to squeeze a few more clocks out of a loop, you should not only align on a boundary that matches the bus size (these days 32 or 64 bit) but also on a cache line boundary, and try to keep that loop within one or maybe two cache lines. On that note a single random nop in a program can cause changes where the cache lines hit and a performance change can be detected if the program is large enough and has enough functions or loops. Same story, say for example you have a branch target at address 0xFFFC, if not in the cache a cacheline has to be fetched, nothing unexpected, but one or two instructions later (four bytes) another cache line is required. If the target had been 0x10000, depending on the size of your function naturally, you might have pulled this off in one cache line. If this is an often called function and another often called function is at a similar enough address that these two evict each other you will run twice as slow. This is a place where the x86 helps though with variable instruction length you can pack more code into a cache line than on other well used architectures.

With x86 and instruction fetches you cant really win. At this point it is often futile to try to hand tune x86 programs (from an instruction perspective). The number of different cores and their nuances, you can make gains on one processor on one computer one day, but that same code will make other x86 processors on other computers run slower, sometimes less than half the speed. It is better to be generically efficient but have a bit of sloppyness to have it run okay on all computers every day. Data alignment will show improvement across processors across computers, but instruction alignment wont.