How can I reconcile short conditional jumps with branch target alignments with `.align` in Delphi assembler?

Question

How to reconcile short conditional jumps with branch target alignments in Delphi assembler?

I’m using Delphi version 10.2 Tokyo, for 32-bit and 64-bit assembly, to write some functions entirely using the assembly.

If I don’t use the .align, the compiler correctly encodes short conditional jumps instructions (2 byte instruction which consists of an 1-byte opcode 074h and 1-byte relative offset -+ up to 07Fh). But if I ever put even a single .align, even as small as .align 4 -- all conditional jump instructions that are located before the .align and have destination located after the .align - in this case all these instructions become 6-byte instructions, not 2-byte as they should be. Only the instructions that are located after the .align remain correctly encoded as 2-byte short.

Delphi Assembler doesn’t accept ‘short’ prefix.

How can I reconcile short conditional jumps with branch target alignments with .align in Delphi assembler?

Here is a sample procedure – please note that there is an .align in the middle.

    procedure Test; assembler;
    label
      label1, label2, label3;
    asm
      mov     al, 1
      cmp     al, 2
      je      label1
      je      label2
      je      label3
    label1:
      mov     al, 3
      cmp     al, 4
      je      label1
      je      label2
      je      label3
      mov     al, 5
      .align 4
    label2:
      cmp     al, 6
      je      label1
      je      label2
      je      label3
      mov     al, 7
      cmp     al, 8
      je      label1
      je      label2
      je      label3
    label3:
    end;

Here is how it is encoded – conditional jumps, located before the align, that point to to label2 and label3 (after the align) are encoded as 6-byte instructions (this is a 64-bit CPU target):

0041C354 B001          mov al,$01      //   mov     al, 1
0041C356 3C02          cmp al,$02      //   cmp     al, 2
0041C358 740C          jz $0041c366    //   je      label1
0041C35A 0F841C000000  jz $0041c37c    //   je      label2
0041C360 0F8426000000  jz $0041c38c    //   je      label3
0041C366 B003          mov al,$03 //label1: mov al, 3
0041C368 3C04          cmp al,$04      //   cmp     al, 4
0041C36A 74FA          jz $0041c366    //   je      label1
0041C36C 0F840A000000  jz $0041c37c    //   je      label2
0041C372 0F8414000000  jz $0041c38c    //   je      label3
0041C378 B005          mov al,$05      //   mov     al, 5
0041C37A 8BC0          mov eax,eax     //  <-- a 2-byte dummy instruction, inserted by ".align 4" (almost a 2-byte NOP)
0041C37C 3C06          cmp al,$06 //label2: cmp al, 6
0041C37E 74E6          jz $0041c366    //   je      label1
0041C380 74FA          jz $0041c37c    //   je      label2
0041C382 7408          jz $0041c38c    //   je      label3
0041C384 B007          mov al,$07      //   mov     al, 7
0041C386 3C08          cmp al,$08      //   cmp     al, 8
0041C388 74DC          jz $0041c366    //   je      label1
0041C38A 74F0          jz $0041c37c    //   je      label2
0041C38C C3            ret        // label3:

But if I remove the .align - all the instructions have correct size - just 2 bytes as they used to be:

0041C354 B001          mov al,$01      //   mov     al, 1
0041C356 3C02          cmp al,$02      //   cmp     al, 2
0041C358 7404          jz $0041c35e    //   je      label1
0041C35A 740E          jz $0041c36a    //   je      label2
0041C35C 741C          jz $0041c37a    //   je      label3
0041C35E B003          mov al,$03 //label1: mov     al, 3
0041C360 3C04          cmp al,$04      //   cmp     al, 4
0041C362 74FA          jz $0041c35e    //   je      label1
0041C364 7404          jz $0041c36a    //   je      label2
0041C366 7412          jz $0041c37a    //   je      label3
0041C368 B005          mov al,$05      //   mov     al, 5
0041C36A 3C06          cmp al,$06 //.align 4 label2:cmp al, 6
0041C36C 74F0          jz $0041c35e    //   je      label1
0041C36E 74FA          jz $0041c36a    //   je      label2
0041C370 7408          jz $0041c37a    //   je      label3
0041C372 B007          mov al,$07      //   mov     al, 7
0041C374 3C08          cmp al,$08      //   cmp     al, 8
0041C376 74E6          jz $0041c35e    //   je      label1
0041C378 74F0          jz $0041c36a    //   je      label2
0041C37A C3            ret             //   je      label3
                                //  label3:

Back to conditional jumps instructions: how can I reconcile short conditional jumps with branch target alignments with .align in Delphi assembler?

I acknowledge that the benefit of aligning branch targets on processors like SkyLake and later is slim and I understand that I can just refrain from using .align - it will also save the code size. But I want to know how can I use Delphi assembler to generate short jumps with align. This problem persists in 32-bit target also, not only in the 64-bit one.

Inconsiderate usage of decor..., which amount to about 900kb - think of mobile. Perhaps you can substitute them with some charts, with a limited number of colors they should come up small. Or with some "lorem ipsum" if it's that you feel that your post is too short for your taste. — Sertac Akyuz
Terminology: Intel's manual calls jcc rel8 a "short" jump, and jcc rel32 a "near" jump. Both of them are near jumps, as opposed to a far jump to a different code segment. So "short" means "near with compact encoding". The online HTML versions get messy after the first page of the table :( — Peter Cordes
@PeterCordes It is a multi-pass assembler since it correctly puts short conditional jumps and long conditional jumps, where appropriate. Probably, there is either a bug or they think that if a programmer have ever used .align - size is no longer an issue and they use large versions. Delphi is a mega-ultra-fast compiler and they might have sacrificed code quality to keep quick compilation speed. Maybe they might not have thought that short jumps are a part of the branch prediction mechanism. — Maxim Masiutin
I do use .align with that assembler. That you get a few long forward branches shouldn't matter a lot. Most branches that matter (e.g. in loops) are backward anyway, and there it works. — Rudy Velthuis
@MaximMasiutin: Yes, both kinds of near jumps (rel8 and rel32) are common in real programs, and prediction works for them. (And rel16 in 16-bit code). I don't know how far jmp executes; it's irrelevant for performance because they're basically never used. (Except on WOW64, apparently, where 32-bit DLLs call into 64-bit code instead of having the kernel support an alternate 32-bit sysenter ABI like Linux does.) I'd guess that far jumps aren't predicted, but it's also possible that the CPU optimistically assumes that there's no call-gate or whatever. — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2017-07-15T19:47:59

Unless your assembler has an option to do better branch-displacement optimization (which might take repeated passes), you're probably out of luck. (Of course you could manually do all the alignment yourself, but that has to be re-done every time you change anything.)

Or you could use a different assembler to assemble. But as I expected, that's highly undesirable because you lose access to Delphi-specific stuff like object layout for things declared outside of the asm. (Thanks @Rudy for the comment.)

It's possible that you could write some of your function in Delphi assembler and do as much as possible of the Delphi-specific stuff there. Write the critical loop part in another assembler, hexdump dump its machine-code output into a db pseudo-instruction that you put in the middle of your Delphi assembly.

This could work ok if the start of every function is at least as aligned as anything inside a function, but you'd probably end up wasting instructions or putting constants into registers for use by the NASM part, which would probably be worse than just having longer branches.

Only the instructions that are located after the .align remain correctly encoded as 2-byte short

That isn't quite accurate. The first je label1 looks ok, and it's before the .align.

It looks like any branch that goes forward across a not-yet-evaluated .align directive leaves room for a rel32, and the assembler never comes back and fixes it. Every other case seems fine: backward branches across a .align, and forward branches that don't cross a .align.

Branch-displacement optimization is not an easy problem, especially when there are .align directives. This appears to be a really sub-optimal implementation, though.

Related: Why is the "start small" algorithm for branch displacement not optimal? for more about the algorithms assemblers use for branch-displacement optimization. Even good assemblers probably don't make optimal choices, especially when there are .align directives.

How can I reconcile short conditional jumps with branch target alignments with `.align` in Delphi assembler?

1 Answers