Laziness
It's not a "compiler optimisation", but it's something guaranteed by the language specification, so you can always count on it happening. Essentially, this means that work is not performed until you "do something" with the result. (Unless you do one of several things to deliberately turn off laziness.)
This, obviously, is an entire topic in its own right, and SO has lots of questions and answers about it already.
In my limited experience, making your code too lazy or too strict has vastly larger performance penalties (in time and space) than any of the other stuff I'm about to talk about...
Strictness analysis
Laziness is about avoiding work unless it's necessary. If the compiler can determine that a given result will "always" be needed, then it won't bother storing the calculation and performing it later; it'll just perform it directly, because that is more efficient. This is so-called "strictness analysis".
The gotcha, obviously, is that the compiler cannot always detect when something could be made strict. Sometimes you need to give the compiler little hints. (I'm not aware of any easy way to determine whether strictness analysis has done what you think it has, other than wading through the Core output.)
Inlining
If you call a function, and the compiler can tell which function you're calling, it may try to "inline" that function - that is, to replace the function call with a copy of the function itself. The overhead of a function call is usually pretty small, but inlining often enables other optimisations to happen which wouldn't have happened otherwise, so inlining can be a big win.
Functions are only inlined if they are "small enough" (or if you add a pragma specifically asking for inlining). Also, functions can only be inlined if the compiler can tell what function you're calling. There are two main ways that the compiler could be unable to tell:
If the function you're calling is passed in from somewhere else. E.g., when the filter
function is compiled, you can't inline the filter predicate, because it's a user-supplied argument.
If the function you're calling is a class method and the compiler doesn't know what type is involved. E.g., when the sum
function is compiled, the compiler can't inline the +
function, because sum
works with several different number types, each of which has a different +
function.
In the latter case, you can use the {-# SPECIALIZE #-}
pragma to generate versions of a function that are hard-coded to a particular type. E.g., {-# SPECIALIZE sum :: [Int] -> Int #-}
would compile a version of sum
hard-coded for the Int
type, meaning that +
can be inlined in this version.
Note, though, that our new special-sum
function will only be called when the compiler can tell that we're working with Int
. Otherwise the original, polymorphic sum
gets called. Again, the actual function call overhead is fairly small. It's the additional optimisations that inlining can enable which are beneficial.
Common subexpression elimination
If a certain block of code calculates the same value twice, the compiler may replace that with a single instance of the same computation. For example, if you do
(sum xs + 1) / (sum xs + 2)
then the compiler might optimise this to
let s = sum xs in (s+1)/(s+2)
You might expect that the compiler would always do this. However, apparently in some situations this can result in worse performance, not better, so GHC does not always do this. Frankly, I don't really understand the details behind this one. But the bottom line is, if this transformation is important to you, it's not hard to do it manually. (And if it's not important, why are you worrying about it?)
Case expressions
Consider the following:
foo (0:_ ) = "zero"
foo (1:_ ) = "one"
foo (_:xs) = foo xs
foo ( []) = "end"
The first three equations all check whether the list is non-empty (among other things). But checking the same thing thrice is wasteful. Fortunately, it's very easy for the compiler to optimise this into several nested case expressions. In this case, something like
foo xs =
case xs of
y:ys ->
case y of
0 -> "zero"
1 -> "one"
_ -> foo ys
[] -> "end"
This is rather less intuitive, but more efficient. Because the compiler can easily do this transformation, you don't have to worry about it. Just write your pattern matching in the most intuitive way possible; the compiler is very good at reordering and rearranging this to make it as fast as possible.
Fusion
The standard Haskell idiom for list processing is to chain together functions that take one list and produce a new list. The canonical example being
map g . map f
Unfortunately, while laziness guarantees skipping unecessary work, all the allocations and deallocations for the intermediate list sap performance. "Fusion" or "deforestation" is where the compiler tries to eliminate these intermediate steps.
The trouble is, most of these functions are recursive. Without the recursion, it would be an elementary exercise in inlining to squish all the functions into one big code block, run the simplifier over it and produce really optimal code with no intermediate lists. But because of the recursion, that won't work.
You can use {-# RULE #-}
pragmas to fix some of this. For example,
{-# RULES "map/map" forall f g xs. map f (map g xs) = map (f.g) xs #-}
Now every time GHC sees map
applied to map
, it squishes it into a single pass over the list, eliminating the intermediate list.
Trouble is, this works only for map
followed by map
. There are many other possibilities - map
followed by filter
, filter
followed by map
, etc. Rather than hand-code a solution for each of them, so-called "stream fusion" was invented. This is a more complicated trick, which I won't describe here.
The long and short of it is: These are all special optimisation tricks written by the programmer. GHC itself knows nothing about fusion; it's all in the list librarys and other container libraries. So what optimisations happen depends on how your container libraries are written (or, more realistically, which libraries you choose to use).
For example, if you work with Haskell '98 arrays, don't expect any fusion of any kind. But I understand that the vector
library has extensive fusion capabilities. It's all about the libraries; the compiler just provides the RULES
pragma. (Which is extremely powerful, by the way. As a library author, you can use it to rewrite client code!)
Meta:
I agree with the people saying "code first, profile second, optimise third".
I also agree with the people saying "it is useful to have a mental model for how much cost a given design decision has".
Balance in all things, and all that...