How can one figure out if a loop is being entered with a 16 byte aligned address in x86-64 assembly?

Question

I'm a beginner when it comes to x86-64 and am trying to get better in particular with regards to performance optimization.

I have read through parts of agner's optimization manual volume 2. It was repeatedly stated how important entering a critical hotspot/loop with 16 byte alignment is. Now I am having trouble figuring out whether an entry into a loop is 16 byte aligned or not.

Are you supposed to add up the byte cost of every instruction in the subroutine before the loop entry and see if it is divisable by 16? I have consulted the Intel Developer Manual for x86-64 and am having trouble reading out of it which instructions have which byte lengths. Is an instruction's byte size simply the opcode added up? So in case of MOV r64/m16 with Opcode REX.W + 8C would the size be 2 bytes? (One for REX.W prefix and one for 8C).

Consider the following code, assume some string is passed as parameter in rdi which is to be manipulated in .LmanipulationLoop:

string_fun:
   cmp cl, byte ptr [rdi]
   jz .Lend
   xor rcx, rcx

.LmanipulationLoop
  *some string operation*

.Lend
  ret

So going based off of my current understanding:

cmp cl, byte ptr [rdi], opcode for this is 0x38 (CMP r/m8, r8) so 1 byte
jz .Lend, opcode for this is 0x0F 84 (jz rel32) so 2 byte (I am unsure about this being the right opcode)
xor rcx, rcx, opcode for this is REX.W + 0x33 (xor r64, r/m64) so 2 bytes

All in all that makes (assuming I'm right) 5 bytes. Does this now mean I would need 11 NOPs before .LmanipulationLoop to ensure an aligned entry into the loop?

Normally you'd use .p2align 4 before the label to get the assembler to figure out how much padding is needed, and emit one or more long NOPs (not 11 single-byte NOPs, that would be terrible). — Peter Cordes
Okay I see. And this would happen at assembly time, right? And if I suspect some misalignment happening somewhere you'd just need to use some other method with more metrics to find out? — Liqs

Peter Cordes Peter Cordes · Accepted Answer · 2020-05-15T15:45:18

You don't need to do this manually, assemblers can do this for you. Manual calculation is only useful if you want to be more clever than just padding with NOPs to align something right after the point where you insert padding.

Normally you'd use .p2align 4 (GAS) or align 16 (NASM¹) before the label to get the assembler to figure out how much padding is needed, and emit one or more long NOPs. (Not 11 single-byte NOPs, that would be terrible because they'd each have to decode separately).

And/or use a debugger or disassembler to check the label address instead of manually computing it, if you're aiming for What methods can be used to efficiently extend instruction length on modern x86?

It's useful to know something about which instructions are what length if you're trying to minimize the number of NOPs needed, but this is one case where some trial/error is fine to find a good sequence of instructions that leaves you needing at most one long NOP.

Aligning loop tops is not always necessary on CPUs with a uop cache

What usually actually matters are 32-byte boundaries for uop cache lines. Or not at all for most small loops on CPUs that have a loop buffer (but note that Skylake / Kaby Lake's LSD is disabled by microcode updates to fix an erratum). 32-byte alignment of the top of a very critical loop could be useful if it avoids a front-end bottleneck fetching from the uop cache. Or for tiny loops that can run at 1 cycle per iteration, having the whole loop in the same uop cache line is essential (otherwise the front-end takes two cycles per iteration to fetch it).

Unfortunately the major issue with loop alignment on Skylake-derived CPUs is to align the bottom of the loop to work around a performance pothole where a jcc or macro-fused compare+branch that touches a 32-byte boundary disables the uop cache for that line.

Simple alignment example:

I fixed the bugs in your source (missing : after the labels, and performance bug of using 32-bit operand-size to xor-zero RCX). Although in this case you might want to xor rcx,rcx just to make it longer since you know some NOP bytes will be needed. A REX.W=0 would be even better, and not hurt performance on Silvermont, though.

And I filled in the placeholder with a SIMD load.

.intel_syntax noprefix
.p2align 4                  # align the top of the function
string_fun:
   cmp cl, byte ptr [rdi]
   jz .Lend
   xor ecx, ecx             # zeroing ECX implicitly zero-extends into RCX, saving a REX prefix
   lea rsi, [rdi + 1024]    # end pointer

# .p2align 4                # emit padding until a 2^4 boundary
.LmanipulationLoop:           # do {
   movdqu  xmm0, [rdi]
      # Do something like pcmpeqb / pmovmskb with the string bytes ...
   add    rdi, 16
   cmp    rdi, rsi
   jb    .LmanipulationLoop   # }while(p < endp);

.Lend:
  ret

Assemble with gcc -Wa,--keep-locals -c foo.S or as --keep-locals foo.s.
--keep-locals makes .L labels visible in the symbol table of the object file.

Then disassemble with objdump -drwC -Mintel foo.o:

0000000000000000 <string_fun>:
   0:   3a 0f                   cmp    cl,BYTE PTR [rdi]
   2:   74 16                   je     1a <.Lend>
   4:   31 c9                   xor    ecx,ecx
   6:   48 8d b7 00 04 00 00    lea    rsi,[rdi+0x400]
     # note address of this label, 
     # or without --keep-locals, of the instruction that you know is the loop top
000000000000000d <.LmanipulationLoop>:
   d:   f3 0f 6f 07             movdqu xmm0,XMMWORD PTR [rdi]
  11:   48 83 c7 10             add    rdi,0x10
  15:   48 39 f7                cmp    rdi,rsi
  18:   72 f3                   jb     d <.LmanipulationLoop>       # note the jump target address

000000000000001a <.Lend>:
  1a:   c3                      ret

Or with the `.p2align 4` uncommented, the assembler emits a 3-byte NOP:

0000000000000000 <string_fun>:
   0:   3a 0f                   cmp    cl,BYTE PTR [rdi]
   2:   74 19                   je     1d <.Lend>
   4:   31 c9                   xor    ecx,ecx
   6:   48 8d b7 00 04 00 00    lea    rsi,[rdi+0x400]
   d:   0f 1f 00                nop    DWORD PTR [rax]         # This is new, note that it's *before* the jump target

0000000000000010 <.LmanipulationLoop>:
  10:   f3 0f 6f 07             movdqu xmm0,XMMWORD PTR [rdi]
  14:   48 83 c7 10             add    rdi,0x10
  18:   48 39 f7                cmp    rdi,rsi
  1b:   72 f3                   jb     10 <.LmanipulationLoop>

000000000000001d <.Lend>:
  1d:   c3                      ret

Disassembling .o object files won't show sane addresses for calls to external functions; it's not linked yet so the rel32 displacements aren't filled in. But -r will show relocation info. And jumps within the source file do get fully resolved at assemble time.

Footnote 1: Note that NASM has a bad default and you need something like this to get long NOPs instead of multiple single-byte NOPs:

%use smartalign
alignmode p6, 64

How can one figure out if a loop is being entered with a 16 byte aligned address in x86-64 assembly?

1 Answers

Aligning loop tops is not always necessary on CPUs with a uop cache

Simple alignment example:

Or with the .p2align 4 uncommented, the assembler emits a 3-byte NOP:

Or with the `.p2align 4` uncommented, the assembler emits a 3-byte NOP: