Performance optimisations of x86-64 assembly - Alignment and branch prediction

Question

I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen(), memset(), etc, using x86-64 assembly with SSE-2 instructions.

So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more.

For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely no reason in terms of code.

So my guess is that there is some issues with code alignment, and/or with branches which get mispredicted.

I know that, even with the same architecture (x86-64), different CPUs have different algorithms for branch prediction.

But is there some general advices, when developing for high performances on x86-64, about code alignment and branch prediction?

In particular, about alignment, should I ensure all labels used by jump instructions are aligned on a DWORD?

_func:
    ; ... Some code ...
    test rax, rax
    jz   .label
    ; ... Some code ...
    ret
    .label:
        ; ... Some code ...
        ret

In the previous code, should I use an align directive before .label:, like:

align 4
.label:

If so, is it enough to align on a DWORD when using SSE-2?

And about branch prediction, is there a «preffered» way to organize the labels used by jump instructions, in order to help the CPU, or are today's CPUs smart enough to determine that at runtime by counting the number of times a branch is taken?

EDIT

Ok, here's a concrete example - here's the start of strlen() with SSE-2:

_strlen64_sse2:
    mov         rsi,    rdi
    and         rdi,    -16
    pxor        xmm0,   xmm0
    pcmpeqb     xmm0,   [ rdi ]
    pmovmskb    rdx,    xmm0
    ; ...

Running it 10'000'000 times with a 1000 character string gives about 0.48 seconds, which is fine.
But it does not check for a NULL string input. So obviously, I'll add a simple check:

_strlen64_sse2:
    test       rdi,    rdi
    jz          .null
    ; ...

Same test, it runs now in 0.59 seconds. But if I align the code after this check:

_strlen64_sse2:
    test       rdi,    rdi
    jz          .null
    align      8
    ; ...

The original performances are back. I used 8 for alignment, as 4 doesn't change anything.
Can anyone explain this, and give some advices about when to align, or not to align code sections?

EDIT 2

Of course, it's not as simple as aligning every branch target. If I do it, performances will usually get worse, unless some specific cases like above.

@KerrekSB Thanks for the comment. Are those instructions still used by modern CPUs, or are they simply ignored? I can't find anything about them in Intel's optimisation manual for x86-64... — Macmade
As far as branch-prediction on modern x86 CPUs is concerned, checkout section 3 of this manual. — TheCodeArtist
I wonder how useful this level of optimization will be in a more realistic setting where the entire string doesn't live in L1 cache, which it clearly does for the benchmark you're using. The 20% performance differences you're worried about could be totally insignificant compared to memory fetch costs. — Gene

TheCodeArtist TheCodeArtist · Accepted Answer · 2013-08-16T18:10:37

Alignment optimisations

1. Use `.p2align <abs-expr> <abs-expr> <abs-expr>` instead of `align`.

Grants fine-grained control using its 3 params

param1 - Align to what boundary.
param2 - Fill padding with what (zeroes or NOPs).
param3 - Do NOT align if padding would exceed specified number of bytes.

2. Align the start of a frequently used code blocks to cache-line size boundaries.

This increases the chances that the entire code-block lies in a single cache-line. Once loaded into the L1-cache, then can run entirely without the need to access RAM for instruction fetch. This is highly beneficial for loops with a large number of iterations.

3. Use multi-byte `NOP`s for padding to reduce the time spent executing `NOP`s.

  /* nop */
  static const char nop_1[] = { 0x90 };

  /* xchg %ax,%ax */
  static const char nop_2[] = { 0x66, 0x90 };

  /* nopl (%[re]ax) */
  static const char nop_3[] = { 0x0f, 0x1f, 0x00 };

  /* nopl 0(%[re]ax) */
  static const char nop_4[] = { 0x0f, 0x1f, 0x40, 0x00 };

  /* nopl 0(%[re]ax,%[re]ax,1) */
  static const char nop_5[] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };

  /* nopw 0(%[re]ax,%[re]ax,1) */
  static const char nop_6[] = { 0x66, 0x0f, 0x1f, 0x44, 0x00, 0x00 };

  /* nopl 0L(%[re]ax) */
  static const char nop_7[] = { 0x0f, 0x1f, 0x80, 0x00, 0x00, 0x00, 0x00 };

  /* nopl 0L(%[re]ax,%[re]ax,1) */
  static const char nop_8[] =
    { 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00};

  /* nopw 0L(%[re]ax,%[re]ax,1) */
  static const char nop_9[] =
    { 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };

  /* nopw %cs:0L(%[re]ax,%[re]ax,1) */
  static const char nop_10[] =
    { 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };

(upto 10byte NOPs for x86. Source binutils-2.2.3.)

Branch-prediction optimisations

_{Lot of variations between x86_64 micro-architectures/generations. However a common set of guidelines that are applicable for all of them can be summarised as follows. Reference : Section 3 of Agner Fog's x86 micro-architecture manual.}

1. Un-roll loops to avoid slightly-too-high iteration counts.

Loop detection logic is guaranteed to work ONLY for loops with < 64 iterations. This is due to the fact that a branch instruction is recognized as having loop behavior if it goes one way n-1 times and then goes the other way 1 time, for any n upto 64.

This doesn't really apply for the predictors in Haswell and later which use a TAGE predictor and doesn't have dedicated loop-detection logic for specific branches. Iteration counts of ~23 can be the worst-case for an inner loop inside a tight outer loop with no other branching, on Skylake: the exit from the inner loop mispredicts most times, but the trip count is so low that it happens often. Unrolling can help by shortening the pattern, but for very high loop trip counts the single mispredict at the end is amortized over a lot of trips and it would take an unreasonable amount of unrolling to do anything about it.

2. Stick to near/short jumps.

Far jumps are not predicted i.e. pipeline always stalls on a far jump to a new code segment (CS:RIP). There's basically never a reason to use a far jump anyway so this is mostly not relevant.

Indirect jumps with an arbitrary 64-bit absolute address are predicted normally on most CPUs.

But Silvermont (Intel's low-power CPUs) have some limitations in predicting indirect jumps when the target is more than 4GB away, so avoiding that by loading/mapping executables and shared libraries in the low 32 bits of virtual address space can be a win there. e.g. on GNU/Linux by setting the environment variable LD_PREFER_MAP_32BIT_EXEC. See Intel's optimization manual for more.