Are there any performance test results for usage of likely/unlikely hints?

votes

gcc features likely/unlikely hints that help the compiler to generate machine code with better branch prediction.

Is there any data on how proper usage or failure to use those hints affects performance of real code on some real systems?

c++c optimizationgccmicro-optimization

I don't think there would be strong metrics, as it is a micro-optimization, and it will depend on how often the hint is correct or not, the size of the binary code in the if/else blocks and maybe even the phase of the moon – David Rodríguez - dribeas

Essentially this maps to the CPU's branch predictor; the size of the binary code is irrelevant. – MSalters

For the performance I have no benchmarks, but a thing that I can say that the assembler that is produced by gcc with such hints is much clearer. – Jens Gustedt

@Jens Gustedt: AFAIK hints only lead to swapping branches. How does code get much cleaner? – sharptooth

Exactly by that. It is much easier to follow the main branch, since this is contiguous, then, and the parts that are considered "unliked" are swapped behind, out of view. – Jens Gustedt

3 Answers

votes

The question differs, but Peter Cordes's answer on this question gives a clear hint ;) . Modern CPU's ignore static hints and use dynamic branch prediction.

votes

I don't know of any thorough analysis of such particular hints. In any case, it would be extremely CPU-specific. In general, if you are sure about the likelyhood (e.g., > 90%) then it is probably worthwhile to add such annotations, although improvements will vary a lot with the specific use case.

Modern Desktop CPUs tend to have very good branch prediction. If your code is on a hot path anyway, the dynamic branch predictor will quickly figure out that the branch is biased on its own. Such hints are mainly useful to help the static predictor which kicks in if no dynamic branch information is available.

On x86, the static predictor predicts forward branches not to be taken and backward branches to be taken (since they usually indicate loops). The compiler will therefore adjust static code layout to match the predictions. (This may also help putting the hot path on adjacent cache lines, which may help further.)

On PPC, some jump instructions have a bit to predict their likelyhood. I don't know if the compiler will rearrange code, too.

I don't know how ARM CPUs predict branches. As a low-power device it may have less sophisticated branch prediction and static prediction could have more impact.

votes

Likely/Unlikely hints work by preloading the ICache with the branch code that is perceived as being generally correct by the programmer. Branch predictors are, by nature of relying upon limited historical data, effective in loops (or small codebases) only, and loops are not always the issue, with regards to branching performance -- for example, in a real-time simulation, or game, where large amounts of sim/game logic need to be processed for large numbers of objects, at a very high rate. Branch predictors cannot operate effectively in this context, and this is a serious performance concern for sim developers. This logic can consist of literally thousands of different, non-repeating conditionals each frame, completely disabling the ability of a branch predictor to operate effectively.

In answer to the original question, compilers tend to assume a conditional will be false, when generating the code to preload the Icache. You should check the assembly output in your code to verify that, and then you might be able to author a macro for conditionals you want to preload in a performant way, if you don't want to structure your code to fit a particular processor architecture.

Some studies have estimated that modern game engines, on modern processors, spend 60-80% of their time on cache misses, and that branch mis-predictions are approximately 15% of those misses. In order to accommodate a modern game engine, a branch predictor would need historical data for the entire game logic frame -- probably involving several MB of data for each pipeline.