12
votes

On a modern Pentium it is no longer possible to give branching hints to the processor it seems. Assuming that a profiling compiler such as gcc with profile-guided optimization gains information about likely branching behavior, what can it do to produce code that will execute more quickly?

The only option I know of is to move unlikely branches to the end of a function. Is there anything else?

Update.

http://download.intel.com/products/processor/manual/325462.pdf volume 2a, section 2.1.1 says

"Branch hint prefixes (2EH, 3EH) allow a program to give a hint to the processor about the most likely code path for a branch. Use these prefixes only with conditional branch instructions (Jcc). Other use of branch hint prefixes and/or other undefined opcodes with Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable behavior."

I don't know if these actually have any effect however.

On the other hand section 3.4.1. of http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf says

" Compilers generate code that improves the efficiency of branch prediction in Intel processors. The Intel C++ Compiler accomplishes this by:

  • keeping code and data on separate pages
  • using conditional move instructions to eliminate branches
  • generating code consistent with the static branch prediction algorithm
  • inlining where appropriate
  • unrolling if the number of iterations is predictable

With profile-guided optimization, the compiler can lay out basic blocks to eliminate branches for the most frequently executed paths of a function or at least improve their predictability. Branch prediction need not be a concern at the source level. For more information, see Intel C++ Compiler documentation. "

http://cache-www.intel.com/cd/00/00/40/60/406096_406096.pdf says in "Performance Improvements with PGO "

" PGO works best for code with many frequently executed branches that are difficult to predict at compile time. An example is the code with intensive error-checking in which the error conditions are false most of the time. The infrequently executed (cold) errorhandling code can be relocated so the branch is rarely predicted incorrectly. Minimizing cold code interleaved into the frequently executed (hot) code improves instruction cache behavior."

4

4 Answers

7
votes

There are two possible sources for the information you want:

  1. There's Intel 64 and IA-32 Architectures Software Developer's Manual (3 volumes). This is a huge work which has evolved for decades. It's the best reference I know on a lot of subjects, including floating-point. In this case, you want to check volume 2, the instruction set reference.
  2. There's Intel 64 and IA-32 Architectures Optmization Reference Manual. This will tell you in somewhat brief terms what to expect from each microarchitecture.

Now, I don't know what you mean by a "modern Pentium" processor, this is 2013, right? There aren't any Pentiums anymore...

The instruction set does support telling the processor if the branch is expected to be taken or not taken by a prefix to the conditional branch instructions (such as JC, JZ, etc). See volume 2A of (1), section 2.1.1 (of the version I have) Instruction Prefixes. There is the 2E and 3E prefixes for not taken and taken respectively.

As to whether these prefixes actually have any effect, if we can get that information, it will be on Optimization Reference Manual, the section for the microarchitecture you want (and I'm sure it won't be the Pentium).

Apart from using those, there is an entire section on the Optimization Reference Manual on that subject, that's section 3.4.1 (of the version I have).

It makes no sense to reproduce that here, since you can download the manual for free. Briefly:

  • Eliminate branches by using conditional instructions (CMOV, SETcc),
  • Consider the static prediction algorithm (3.4.1.3),
  • Inlining
  • Loop unrolling

Also, some compilers, GCC, for instance, even when CMOV is not possible, often perform bitwise arithmetic to select one of two distinct things computed, thus avoiding branches. It does this particularly with SSE instructions when vectorizing loops.

Basically, the static conditions are:

  • Unconditional branches are predicted to be taken (... kind of expectable...)
  • Indirect branches are predicted not to be taken (because of a data dependency)
  • Backward conditionals are predicted to be taken (good for loops)
  • Forward conditionals are predicted not to be taken

You probably want to read the entire section 3.4.1.

3
votes

If it's clear that a loop is rarely entered, or that it normally iterates very few times, then the compiler might avoid unrolling the loop, as doing so can add a lot of harmful complexity to handle edge conditions (an odd-number iterations, etc.). Vectorisation, in particular, should be avoided in such cases.

The compiler might rearrange nested tests, so that the one that most frequently results in a short-cut can be used to avoid performing a test on something with a 50% pass rate.

Register allocation can be optimised to avoid having a rarely-used block force register spill in the common case.

These are just some examples. I'm sure there are others I haven't thought of.

2
votes

Off the top of my head, you have two options.

Option #1: Inform the compiler of the hints and let the compiler organize the code appropriately. For example, GCC supports the following ...

__builtin_expect((long)!!(x), 1L)  /* GNU C to indicate that <x> will likely be TRUE */
__builtin_expect((long)!!(x), 0L)  /* GNU C to indicate that <x> will likely be FALSE */

If you put them in macro form such as ...

#if <some condition to indicate support>
    #define LIKELY(x)    __builtin_expect((long)!!(x), 1L)
    #define UNLIKELY(x)  __builtin_expect((long)!!(x), 0L)
#else
    #define LIKELY(x)   (x)
    #define UNLIKELY(x) (x)
#endif

... you can now use them as ...

if (LIKELY (x != 0)) {
    /* DO SOMETHING */
} else {
    /* DO SOMETHING ELSE */
}

This leaves the compiler free to organize the branches according to static branch prediction algorithms, and/or if the processor and compiler support it, to use instructions that indicate which branch is more likely to be taken.

Option #2: Use math to avoid branching.

if (a < b)
    y = C;
else
    y = D;

This could be re-written as ...

x = -(a < b);   /* x = -1 if a < b, x = 0 if a >= b */
x &= (C - D);   /* x = C - D if a < b, x = 0 if a >= b */
x += D;         /* x = C if a < b, x = D if a >= b */

Hope this helps.

1
votes

It can make the fall-through (ie the case where a branch is not taken) the most used path. That has two big effects:

  1. only 1 branch can be taken per clock, or on some processors even per 2 clocks, so if there are any other branches (there usually are, most code that matters is in a loop), a taken branch is bad news, a non-taken branch less so.
  2. when the branch predictor is wrong, the code that it does have to execute is more likely to be in the code cache (or µop cache, where applicable). If it wasn't, that would have been a double-whammy of restarting the pipeline and waiting for a cache miss. This is less of an issue in most loops, since both sides of the branch are likely to be in the cache, but it comes into play in big loops and other code.

It can also decide whether to do if-conversion based on better data than a heuristic guess. If-conversions may seem like "always a good idea", but they're not, they're only "often a good idea". If the branch in the branching implementation is very well-predicted, the if-converted code can well be slower.