29
votes

I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen(), memset(), etc, using x86-64 assembly with SSE-2 instructions.

So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more.

For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely no reason in terms of code.

So my guess is that there is some issues with code alignment, and/or with branches which get mispredicted.

I know that, even with the same architecture (x86-64), different CPUs have different algorithms for branch prediction.

But is there some general advices, when developing for high performances on x86-64, about code alignment and branch prediction?

In particular, about alignment, should I ensure all labels used by jump instructions are aligned on a DWORD?

_func:
    ; ... Some code ...
    test rax, rax
    jz   .label
    ; ... Some code ...
    ret
    .label:
        ; ... Some code ...
        ret

In the previous code, should I use an align directive before .label:, like:

align 4
.label:

If so, is it enough to align on a DWORD when using SSE-2?

And about branch prediction, is there a «preffered» way to organize the labels used by jump instructions, in order to help the CPU, or are today's CPUs smart enough to determine that at runtime by counting the number of times a branch is taken?

EDIT

Ok, here's a concrete example - here's the start of strlen() with SSE-2:

_strlen64_sse2:
    mov         rsi,    rdi
    and         rdi,    -16
    pxor        xmm0,   xmm0
    pcmpeqb     xmm0,   [ rdi ]
    pmovmskb    rdx,    xmm0
    ; ...

Running it 10'000'000 times with a 1000 character string gives about 0.48 seconds, which is fine.
But it does not check for a NULL string input. So obviously, I'll add a simple check:

_strlen64_sse2:
    test       rdi,    rdi
    jz          .null
    ; ...

Same test, it runs now in 0.59 seconds. But if I align the code after this check:

_strlen64_sse2:
    test       rdi,    rdi
    jz          .null
    align      8
    ; ...

The original performances are back. I used 8 for alignment, as 4 doesn't change anything.
Can anyone explain this, and give some advices about when to align, or not to align code sections?

EDIT 2

Of course, it's not as simple as aligning every branch target. If I do it, performances will usually get worse, unless some specific cases like above.

4
SSE2 has branch hint prefixes (2E and 3E).Kerrek SB
@KerrekSB Thanks for the comment. Are those instructions still used by modern CPUs, or are they simply ignored? I can't find anything about them in Intel's optimisation manual for x86-64...Macmade
Branch hints are ignored by all processors except P4.harold
As far as branch-prediction on modern x86 CPUs is concerned, checkout section 3 of this manual.TheCodeArtist
I wonder how useful this level of optimization will be in a more realistic setting where the entire string doesn't live in L1 cache, which it clearly does for the benchmark you're using. The 20% performance differences you're worried about could be totally insignificant compared to memory fetch costs.Gene

4 Answers

27
votes

Alignment optimisations

1. Use .p2align <abs-expr> <abs-expr> <abs-expr> instead of align.

Grants fine-grained control using its 3 params

  • param1 - Align to what boundary.
  • param2 - Fill padding with what (zeroes or NOPs).
  • param3 - Do NOT align if padding would exceed specified number of bytes.

2. Align the start of a frequently used code blocks to cache-line size boundaries.

  • This increases the chances that the entire code-block lies in a single cache-line. Once loaded into the L1-cache, then can run entirely without the need to access RAM for instruction fetch. This is highly beneficial for loops with a large number of iterations.

3. Use multi-byte NOPs for padding to reduce the time spent executing NOPs.

  /* nop */
  static const char nop_1[] = { 0x90 };

  /* xchg %ax,%ax */
  static const char nop_2[] = { 0x66, 0x90 };

  /* nopl (%[re]ax) */
  static const char nop_3[] = { 0x0f, 0x1f, 0x00 };

  /* nopl 0(%[re]ax) */
  static const char nop_4[] = { 0x0f, 0x1f, 0x40, 0x00 };

  /* nopl 0(%[re]ax,%[re]ax,1) */
  static const char nop_5[] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };

  /* nopw 0(%[re]ax,%[re]ax,1) */
  static const char nop_6[] = { 0x66, 0x0f, 0x1f, 0x44, 0x00, 0x00 };

  /* nopl 0L(%[re]ax) */
  static const char nop_7[] = { 0x0f, 0x1f, 0x80, 0x00, 0x00, 0x00, 0x00 };

  /* nopl 0L(%[re]ax,%[re]ax,1) */
  static const char nop_8[] =
    { 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00};

  /* nopw 0L(%[re]ax,%[re]ax,1) */
  static const char nop_9[] =
    { 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };

  /* nopw %cs:0L(%[re]ax,%[re]ax,1) */
  static const char nop_10[] =
    { 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };

(upto 10byte NOPs for x86. Source binutils-2.2.3.)


Branch-prediction optimisations

Lot of variations between x86_64 micro-architectures/generations. However a common set of guidelines that are applicable for all of them can be summarised as follows. Reference : Section 3 of Agner Fog's x86 micro-architecture manual.

1. Un-roll loops to avoid slightly-too-high iteration counts.

  • Loop detection logic is guaranteed to work ONLY for loops with < 64 iterations. This is due to the fact that a branch instruction is recognized as having loop behavior if it goes one way n-1 times and then goes the other way 1 time, for any n upto 64.

    This doesn't really apply for the predictors in Haswell and later which use a TAGE predictor and doesn't have dedicated loop-detection logic for specific branches. Iteration counts of ~23 can be the worst-case for an inner loop inside a tight outer loop with no other branching, on Skylake: the exit from the inner loop mispredicts most times, but the trip count is so low that it happens often. Unrolling can help by shortening the pattern, but for very high loop trip counts the single mispredict at the end is amortized over a lot of trips and it would take an unreasonable amount of unrolling to do anything about it.

2. Stick to near/short jumps.

  • Far jumps are not predicted i.e. pipeline always stalls on a far jump to a new code segment (CS:RIP). There's basically never a reason to use a far jump anyway so this is mostly not relevant.

    Indirect jumps with an arbitrary 64-bit absolute address are predicted normally on most CPUs.

    But Silvermont (Intel's low-power CPUs) have some limitations in predicting indirect jumps when the target is more than 4GB away, so avoiding that by loading/mapping executables and shared libraries in the low 32 bits of virtual address space can be a win there. e.g. on GNU/Linux by setting the environment variable LD_PREFER_MAP_32BIT_EXEC. See Intel's optimization manual for more.

24
votes

To extends on TheCodeArtist's answer, who made some good points, here are a few additional stuff and details, as I was actually able to solve the problem.

1 - Code alignment

Intel recommends aligning code and branch targets on 16-byte boundaries:

3.4.1.5 - Assembly/Compiler Coding Rule 12. (M impact, H generality)
All branch targets should be 16-byte aligned.

While this is usually a good advice, it should be done carefully.
Blindly 16-byte aligning everything may lead to performance lost, so this should be tested on each branch target before applying.

As TheCodeArtist pointed it out, using multi-byte NOPs may help here, as simply using standard one-byte NOPs may not bring the expected performance gain of code alignment.

As a sidenote, the .p2align directive is not available in NASM or YASM.
But they do support alignment with other instructions than NOPs with the standard align directive:

align 16, xor rax, rax

2 . Branch prediction

This turned out to be the most important part.
While it's right that every generation of x86-64 CPUs have different branch prediction algorithms, some simple rules can be applied generally to help the CPU predicting which branch will likely be taken.

The CPU tries to keep a branching history in the BTB (Branch Target Buffer).
But when branch information is not available in the BTB, the CPU will use what they call static prediction, which obey to simple rules, as mentioned in Intel's manuals:

  1. Predict forward conditional branches to be not taken.
  2. Predict backward conditional branches to be taken.

Here's an example for the first case:

test rax, rax
jz   .label

; Fallthrough - Most likely

.label:

    ; Forward branch - Most unlikely

Instructions under .label is the unlikely condition, because .label is declared after the actual branch.

For the second case:

.label:

    ; Backward branch - Most likely

test rax, rax
jz   .label

; Fallthrough - Most unlikely

Here, the instructions under .label are the likely condition, as .label is declared before the actual branch.

So each conditional branch should always follow this simple pattern.
And of course, this is also suitable for loops.

As I mentioned before, this was the most important part.

I was experiencing unpredictable performance gains or losses while adding simple tests that should logically improve the overall performances.
Sticking blindly to these rules solved the issues.
If not, the addition of a branch for optimisation purpose may have the opposite result.

TheCodeArtist also mentions loop unrolling in his answer.
While this wasn't the issue, as my loops were already unrolled, I mention it here as it's indeed extremely important, and brings substantial performance gains.

And as a last note for the readers, while this may seem obvious and wasn't the problem here, don't branch when unnecessary.

Starting with the Pentium Pro, x86 processors have conditional move instructions, which may help to eliminate branching and suppress the risk of misprediction:

test   rax, rax
cmovz  rbx, rcx

So just in case, nice thing to keep in mind.

5
votes

To get a better understanding of why and how alignment matters, check out Agner Fog's the microarchitecture doc, esp. the section about the instruction-fetch front-end of various CPU designs. Sandybridge introduced the uop cache, which makes a huge different to throughput, esp. in SSE code where instruction length is often too long for 16B per cycle to cover 4 instructions.

The rules for filling uop cache lines are complicated, but a new block of 32B of instructions always starts a new cache line, IIRC. So aligning hot function entry points to 32B is a good idea. That much padding in other cases might be hurting I$ density more than helping. (L1 I$ still has 64B cache lines, though, so some things might hurt L1 I$ density while helping uop cache density.)

The loop buffer helps too, but taken branches disrupt the 4 uops per cycle. e.g. a loop of 3 uops executes like abc, abc, not abca, bcda. So a 5-uop loop goes at one iteration per 2 cycles, not one per 1.25. This makes unrolling even more valuable.

3
votes

The "branch targets should be 16 byte aligned rule" is not an absolute. The reason for the rule is that with 16 byte alignment, 16 bytes of instructions can be read in one cycle, and then another 16 bytes in the next cycle. If your target is at offset 16n + 2, then the processor can still read 14 bytes of instructions (the remainder of the cache line) in one cycle, and that is often good enough. Starting a loop at offset 16n + 15 however is a bad idea, since only one instruction byte can be read at a time. More useful is to keep the whole loop in the smallest number of cache lines possible.

On some processors branch prediction has the odd behaviour that all branches within 8 or 4 bytes use the same branch predictor. Move branches so that each conditional branch uses its own branch predictor.

What both of these have in common is that inserting some bits of code can change the behaviour and make it faster or slower.