What optimizations can GHC be expected to perform reliably?

Question

GHC has a lot of optimizations that it can perform, but I don't know what they all are, nor how likely they are to be performed and under what circumstances.

My question is: what transformations can I expect it to apply every time, or nearly so? If I look at a piece of code that's going to be executed (evaluated) frequently and my first thought is "hmm, maybe I should optimize that", in which cases should my second thought be, "don't even think about it, GHC got this"?

I was reading the paper Stream Fusion: From Lists to Streams to Nothing at All, and the technique they used of rewriting list processing into a different form which GHC's normal optimizations would then reliably optimize down into simple loops was novel to me. How can I tell when my own programs are eligible for that kind of optimization?

There's some information in the GHC manual, but it only goes part of the way towards answering the question.

EDIT: I'm starting a bounty. What I would like is a list of lower-level transformations like lambda/let/case-floating, type/constructor/function argument specialization, strictness analysis and unboxing, worker/wrapper, and whatever else significant GHC does that I've left out, along with explanations and examples of input and output code, and ideally illustrations of situations when the total effect is more than the sum of its parts. And ideally some mention of when transformations won't happen. I'm not expecting novel-length explanations of every transformation, a couple of sentences and inline one-liner code examples could be enough (or a link, if it's not to twenty pages of scientific paper), as long as the big picture is clear by the end of it. I want to be able to look at a piece of code and be able to make a good guess about whether it will compile down to a tight loop, or why not, or what I would have to change to make it. (I'm not interested so much here in the big optimization frameworks like stream fusion (I just read a paper about that); more in the kind of knowledge that people who write these frameworks have.)

This is a most worthy question. Writing a worthy answer is... tricky. — MathematicalOrchid
A really good starting point is this: aosabook.org/en/ghc.html — Gabriel Gonzalez
In any language, if your first thought is "maybe I should optimize that," your second thought should be "I'll profile it first". — John L
While the sort of knowledge you're after is helpful, and so this is still a good question, I think you're really better served by trying to do as little optimisation as possible. Write what you mean, and only when it becomes apparent that you need to then think about making the code less straightforward for the sake of performance. Rather than looking at code and thinking "that's going to be executed frequently, maybe I should optimise it", it should be only when you're observing code running too slowly that you think "I should find out what's executed frequently and optimise that". — Ben
I completely anticipated that part would call forth the exhortations to "profile it!" :). But I guess the other side of the coin is, if I profile it and it's slow, maybe I can rewrite or just tweak it into a form that's still high level but GHC can better optimize, instead of hand-optimizing it myself? Which requires the same kind of knowledge. And if I had had that knowledge in the first place I could have saved myself an edit-profile cycle. — glaebhoerl

gereeter gereeter · Accepted Answer · 2012-09-30T01:44:27

This GHC Trac page also explains the passes fairly well. This page explains the optimization ordering, though, like the majority of the Trac Wiki, it is out of date.

For specifics, the best thing to do is probably to look at how a specific program is compiled. The best way to see which optimizations are being performed is to compile the program verbosely, using the -v flag. Taking as an example the first piece of Haskell I could find on my computer:

Glasgow Haskell Compiler, Version 7.4.2, stage 2 booted by GHC version 7.4.1
Using binary package database: /usr/lib/ghc-7.4.2/package.conf.d/package.cache
wired-in package ghc-prim mapped to ghc-prim-0.2.0.0-7d3c2c69a5e8257a04b2c679c40e2fa7
wired-in package integer-gmp mapped to integer-gmp-0.4.0.0-af3a28fdc4138858e0c7c5ecc2a64f43
wired-in package base mapped to base-4.5.1.0-6e4c9bdc36eeb9121f27ccbbcb62e3f3
wired-in package rts mapped to builtin_rts
wired-in package template-haskell mapped to template-haskell-2.7.0.0-2bd128e15c2d50997ec26a1eaf8b23bf
wired-in package dph-seq not found.
wired-in package dph-par not found.
Hsc static flags: -static
*** Chasing dependencies:
Chasing modules from: *SleepSort.hs
Stable obj: [Main]
Stable BCO: []
Ready for upsweep
  [NONREC
      ModSummary {
         ms_hs_date = Tue Oct 18 22:22:11 CDT 2011
         ms_mod = main:Main,
         ms_textual_imps = [import (implicit) Prelude, import Control.Monad,
                            import Control.Concurrent, import System.Environment]
         ms_srcimps = []
      }]
*** Deleting temp files:
Deleting: 
compile: input file SleepSort.hs
Created temporary directory: /tmp/ghc4784_0
*** Checking old interface for main:Main:
[1 of 1] Compiling Main             ( SleepSort.hs, SleepSort.o )
*** Parser:
*** Renamer/typechecker:
*** Desugar:
Result size of Desugar (after optimization) = 79
*** Simplifier:
Result size of Simplifier iteration=1 = 87
Result size of Simplifier iteration=2 = 93
Result size of Simplifier iteration=3 = 83
Result size of Simplifier = 83
*** Specialise:
Result size of Specialise = 83
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = False}) = 95
*** Float inwards:
Result size of Float inwards = 95
*** Simplifier:
Result size of Simplifier iteration=1 = 253
Result size of Simplifier iteration=2 = 229
Result size of Simplifier = 229
*** Simplifier:
Result size of Simplifier iteration=1 = 218
Result size of Simplifier = 218
*** Simplifier:
Result size of Simplifier iteration=1 = 283
Result size of Simplifier iteration=2 = 226
Result size of Simplifier iteration=3 = 202
Result size of Simplifier = 202
*** Demand analysis:
Result size of Demand analysis = 202
*** Worker Wrapper binds:
Result size of Worker Wrapper binds = 202
*** Simplifier:
Result size of Simplifier = 202
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = True}) = 210
*** Common sub-expression:
Result size of Common sub-expression = 210
*** Float inwards:
Result size of Float inwards = 210
*** Liberate case:
Result size of Liberate case = 210
*** Simplifier:
Result size of Simplifier iteration=1 = 206
Result size of Simplifier = 206
*** SpecConstr:
Result size of SpecConstr = 206
*** Simplifier:
Result size of Simplifier = 206
*** Tidy Core:
Result size of Tidy Core = 206
writeBinIface: 4 Names
writeBinIface: 28 dict entries
*** CorePrep:
Result size of CorePrep = 224
*** Stg2Stg:
*** CodeGen:
*** CodeOutput:
*** Assembler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-I.' '-c' '/tmp/ghc4784_0/ghc4784_0.s' '-o' 'SleepSort.o'
Upsweep completely successful.
*** Deleting temp files:
Deleting: /tmp/ghc4784_0/ghc4784_0.c /tmp/ghc4784_0/ghc4784_0.s
Warning: deleting non-existent /tmp/ghc4784_0/ghc4784_0.c
link: linkables are ...
LinkableM (Sat Sep 29 20:21:02 CDT 2012) main:Main
   [DotO SleepSort.o]
Linking SleepSort ...
*** C Compiler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-c' '/tmp/ghc4784_0/ghc4784_0.c' '-o' '/tmp/ghc4784_0/ghc4784_0.o' '-DTABLES_NEXT_TO_CODE' '-I/usr/lib/ghc-7.4.2/include'
*** C Compiler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-c' '/tmp/ghc4784_0/ghc4784_0.s' '-o' '/tmp/ghc4784_0/ghc4784_1.o' '-DTABLES_NEXT_TO_CODE' '-I/usr/lib/ghc-7.4.2/include'
*** Linker:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-o' 'SleepSort' 'SleepSort.o' '-L/usr/lib/ghc-7.4.2/base-4.5.1.0' '-L/usr/lib/ghc-7.4.2/integer-gmp-0.4.0.0' '-L/usr/lib/ghc-7.4.2/ghc-prim-0.2.0.0' '-L/usr/lib/ghc-7.4.2' '/tmp/ghc4784_0/ghc4784_0.o' '/tmp/ghc4784_0/ghc4784_1.o' '-lHSbase-4.5.1.0' '-lHSinteger-gmp-0.4.0.0' '-lgmp' '-lHSghc-prim-0.2.0.0' '-lHSrts' '-lm' '-lrt' '-ldl' '-u' 'ghczmprim_GHCziTypes_Izh_static_info' '-u' 'ghczmprim_GHCziTypes_Czh_static_info' '-u' 'ghczmprim_GHCziTypes_Fzh_static_info' '-u' 'ghczmprim_GHCziTypes_Dzh_static_info' '-u' 'base_GHCziPtr_Ptr_static_info' '-u' 'base_GHCziWord_Wzh_static_info' '-u' 'base_GHCziInt_I8zh_static_info' '-u' 'base_GHCziInt_I16zh_static_info' '-u' 'base_GHCziInt_I32zh_static_info' '-u' 'base_GHCziInt_I64zh_static_info' '-u' 'base_GHCziWord_W8zh_static_info' '-u' 'base_GHCziWord_W16zh_static_info' '-u' 'base_GHCziWord_W32zh_static_info' '-u' 'base_GHCziWord_W64zh_static_info' '-u' 'base_GHCziStable_StablePtr_static_info' '-u' 'ghczmprim_GHCziTypes_Izh_con_info' '-u' 'ghczmprim_GHCziTypes_Czh_con_info' '-u' 'ghczmprim_GHCziTypes_Fzh_con_info' '-u' 'ghczmprim_GHCziTypes_Dzh_con_info' '-u' 'base_GHCziPtr_Ptr_con_info' '-u' 'base_GHCziPtr_FunPtr_con_info' '-u' 'base_GHCziStable_StablePtr_con_info' '-u' 'ghczmprim_GHCziTypes_False_closure' '-u' 'ghczmprim_GHCziTypes_True_closure' '-u' 'base_GHCziPack_unpackCString_closure' '-u' 'base_GHCziIOziException_stackOverflow_closure' '-u' 'base_GHCziIOziException_heapOverflow_closure' '-u' 'base_ControlziExceptionziBase_nonTermination_closure' '-u' 'base_GHCziIOziException_blockedIndefinitelyOnMVar_closure' '-u' 'base_GHCziIOziException_blockedIndefinitelyOnSTM_closure' '-u' 'base_ControlziExceptionziBase_nestedAtomically_closure' '-u' 'base_GHCziWeak_runFinalizzerBatch_closure' '-u' 'base_GHCziTopHandler_flushStdHandles_closure' '-u' 'base_GHCziTopHandler_runIO_closure' '-u' 'base_GHCziTopHandler_runNonIO_closure' '-u' 'base_GHCziConcziIO_ensureIOManagerIsRunning_closure' '-u' 'base_GHCziConcziSync_runSparks_closure' '-u' 'base_GHCziConcziSignal_runHandlers_closure'
link: done
*** Deleting temp files:
Deleting: /tmp/ghc4784_0/ghc4784_1.o /tmp/ghc4784_0/ghc4784_0.s /tmp/ghc4784_0/ghc4784_0.o /tmp/ghc4784_0/ghc4784_0.c
*** Deleting temp dirs:
Deleting: /tmp/ghc4784_0

Looking from the first *** Simplifier: to the last, where all the optimization phases happen, we see quite a lot.

First of all, the Simplifier runs between almost all the phases. This makes writing many passes much easier. For example, when implementing many optimizations, they simply create rewrite rules to propagate the changes instead of having to do it manually. The simplifier encompasses a number of simple optimizations, including inlining and fusion. The main limitation of this that I know is that GHC refuses to inline recursive functions, and that things have to be named correctly for fusion to work.

Next, we see a full list of all the optimizations performed:

Specialise

The basic idea of specialization is to remove polymorphism and overloading by identifying places where the function is called and creating versions of the function that aren't polymorphic - they are specific to the types they are called with. You can also tell the compiler to do this with the SPECIALISE pragma. As an example, take a factorial function:
```
fac :: (Num a, Eq a) => a -> a
fac 0 = 1
fac n = n * fac (n - 1)
```
As the compiler doesn't know any properties of the multiplication that is to be used, it cannot optimize this at all. If however, it sees that it is used on an Int, it now can create a new version, differing only in the type:
```
fac_Int :: Int -> Int
fac_Int 0 = 1
fac_Int n = n * fac_Int (n - 1)
```
Next, rules mentioned below can fire, and you end up with something working on unboxed Ints, which is much faster than the original. Another way to look at specialisation is partial application on type class dictionaries and type variables.

The source here has a load of notes in it.
Float out

EDIT: I apparently misunderstood this before. My explanation has completely changed.

The basic idea of this is to move computations that shouldn't be repeated out of functions. For example, suppose we had this:
```
\x -> let y = expensive in x+y
```
In the above lambda, every time the function is called, y is recomputed. A better function, which floating out produces, is
```
let y = expensive in \x -> x+y
```
To facilitate the process, other transformations may be applied. For example, this happens:
```
 \x -> x + f 2
 \x -> x + let f_2 = f 2 in f_2
 \x -> let f_2 = f 2 in x + f_2
 let f_2 = f 2 in \x -> x + f_2
```
Again, repeated computation is saved.

The source is very readable in this case.

At the moment bindings between two adjacent lambdas are not floated. For example, this does not happen:
```
\x y -> let t = x+x in ...
```
going to
```
 \x -> let t = x+x in \y -> ...
```
Float inwards

Quoting the source code,

The main purpose of floatInwards is floating into branches of a case, so that we don't allocate things, save them on the stack, and then discover that they aren't needed in the chosen branch.

As an example, suppose we had this expression:
```
let x = big in
    case v of
        True -> x + 1
        False -> 0
```
If v evaluates to False, then by allocating x, which is presumably some big thunk, we have wasted time and space. Floating inwards fixes this, producing this:
```
case v of
    True -> let x = big in x + 1
    False -> let x = big in 0
```
, which is subsequently replaced by the simplifier with
```
case v of
    True -> big + 1
    False -> 0
```
This paper, although covering other topics, gives a fairly clear introduction. Note that despite their names, floating in and floating out don't get in an infinite loop for two reasons:
1. Float in floats lets into case statements, while float out deals with functions.
2. There is a fixed order of passes, so they shouldn't be alternating infinitely.

Demand analysis

Demand analysis, or strictness analysis is less of a transformation and more, like the name suggests, of an information gathering pass. The compiler finds functions that always evaluate their arguments (or at least some of them), and passes those arguments using call-by-value, instead of call-by-need. Since you get to evade the overheads of thunks, this is often much faster. Many performance problems in Haskell arise from either this pass failing, or code simply not being strict enough. A simple example is the difference between using foldr, foldl, and foldl' to sum a list of integers - the first causes stack overflow, the second causes heap overflow, and the last runs fine, because of strictness. This is probably the easiest to understand and best documented of all of these. I believe that polymorphism and CPS code often defeat this.
Worker Wrapper binds

The basic idea of the worker/wrapper transformation is to do a tight loop on a simple structure, converting to and from that structure at the ends. For example, take this function, which calculates the factorial of a number.
```
factorial :: Int -> Int
factorial 0 = 1
factorial n = n * factorial (n - 1)
```
Using the definition of Int in GHC, we have
```
factorial :: Int -> Int
factorial (I# 0#) = I# 1#
factorial (I# n#) = I# (n# *# case factorial (I# (n# -# 1#)) of
    I# down# -> down#)
```
Notice how the code is covered in I#s? We can remove them by doing this:
```
factorial :: Int -> Int
factorial (I# n#) = I# (factorial# n#)

factorial# :: Int# -> Int#
factorial# 0# = 1#
factorial# n# = n# *# factorial# (n# -# 1#)
```
Although this specific example could have also been done by SpecConstr, the worker/wrapper transformation is very general in the things it can do.
Common sub-expression

This is another really simple optimization that is very effective, like strictness analysis. The basic idea is that if you have two expressions that are the same, they will have the same value. For example, if fib is a Fibonacci number calculator, CSE will transform
```
fib x + fib x
```
into
```
let fib_x = fib x in fib_x + fib_x
```
which cuts the computation in half. Unfortunately, this can occasionally get in the way of other optimizations. Another problem is that the two expressions have to be in the same place and that they have to be syntactically the same, not the same by value. For example, CSE won't fire in the following code without a bunch of inlining:
```
x = (1 + (2 + 3)) + ((1 + 2) + 3)
y = f x
z = g (f x) y
```
However, if you compile via llvm, you may get some of this combined, due to its Global Value Numbering pass.
Liberate case

This seems to be a terribly documented transformation, besides the fact that it can cause code explosion. Here is a reformatted (and slightly rewritten) version of the little documentation I found:

This module walks over Core, and looks for case on free variables. The criterion is: if there is a case on a free variable on the route to the recursive call, then the recursive call is replaced with an unfolding. For example, in
```
f = \ t -> case v of V a b -> a : f t
```
the inner f is replaced. to make
```
f = \ t -> case v of V a b -> a : (letrec f = \ t -> case v of V a b -> a : f t in f) t
```
Note the need for shadowing. Simplifying, we get
```
f = \ t -> case v of V a b -> a : (letrec f = \ t -> a : f t in f t)
```
This is better code, because a is free inside the inner letrec, rather than needing projection from v. Note that this deals with free variables, unlike SpecConstr, which deals with arguments that are of known form.

See below for more information about SpecConstr.
SpecConstr - this transforms programs like
```
f (Left x) y = somthingComplicated1
f (Right x) y = somethingComplicated2
```
into
```
f_Left x y = somethingComplicated1
f_Right x y = somethingComplicated2

{-# INLINE f #-}
f (Left x) = f_Left x
f (Right x) = f_Right x
```
As an extended example, take this definition of last:
```
last [] = error "last: empty list"
last (x:[]) = x
last (x:x2:xs) = last (x2:xs)
```
We first transform it to
```
last_nil = error "last: empty list"
last_cons x [] = x
last_cons x (x2:xs) = last (x2:xs)

{-# INLINE last #-}
last [] = last_nil
last (x : xs) = last_cons x xs
```
Next, the simplifier runs, and we have
```
last_nil = error "last: empty list"
last_cons x [] = x
last_cons x (x2:xs) = last_cons x2 xs

{-# INLINE last #-}
last [] = last_nil
last (x : xs) = last_cons x xs
```
Note that the program is now faster, as we are not repeatedly boxing and unboxing the front of the list. Also note that the inlining is crucial, as it allows the new, more efficient definitions to actually be used, as well as making recursive definitions better.

SpecConstr is controlled by a number of heuristics. The ones mentioned in the paper are as such:
1. The lambdas are explicit and the arity is a.
2. The right hand side is "sufficiently small," something controlled by a flag.
3. The function is recursive, and the specializable call is used in the right hand side.
4. All of the arguments to the function are present.
5. At least one of the arguments is a constructor application.
6. That argument is case-analysed somewhere in the function.
However, the heuristics have almost certainly changed. In fact, the paper mentions an alternative sixth heuristic:

Specialise on an argument x only if x is only scrutinised by a case, and is not passed to an ordinary function, or returned as part of the result.

This was a very small file (12 lines) and so possibly didn't trigger that many optimizations (though I think it did them all). This also doesn't tell you why it picked those passes and why it put them in that order.

What optimizations can GHC be expected to perform reliably?

3 Answers