3
votes

Disclaimer: I know little about ghc compiling pipeline, but I hope to learn some more about it with this post, for example, if comparing imperative vs functional is relevant to code compilation.

As you know, loop unrolling reduces the number of iterations over a loop by duplicating the code inside it. This improves performance since it reduces the number of jumps (and penalities associated with it) and AFAIR, creates bigger blocks of code, leaving room for better Register renaming optimization.

I was wondering, could there be an equivalent to Loop Unrolling for functional programming? Could we 'unroll' a function, open/expand it's definition, to first reduce the number of calls to said function and/or creating bigger chunks of code -- then leaving room for more code rewrite optimizations (like register renaming, or some FP equivalent)?

Something that would 'unroll' or 'expand' a function definition, using for example function evaluation (maybe mixed with some tactic) in order to have a trade-off between space vs time.

An example of what I had in mind:

 map1 _ [] = []
 map1 f (x:xs) = (f x): map f xs

Would unroll to

map2 _ [] = []
map2 f (x:x1:xs) = (f x):(f x1):map2 f xs
map2 f (x:xs) = (f x): map2 f xs

Once more:

map4 _ [] = []
map4 f (x:x1:x2:x3:xs) = (f x):(f x1):(f x2):(f x3):map4 f xs
map4 f (x:x1:x2:xs) = (f x):(f x1):(f x2):map4 f xs
map4 f (x:x1:xs) = (f x):(f x1):map4 f xs
map4 f (x:xs) = (f x): map4 f xs

Two things are at play: multiple cases of map4 (and consequent tests on list) could degrade performance, or the reduced number of calls of map4 would improve performance. Maybe this could reduce some constant overhead created by lazy evaluation?

Well that doesn't seems to hard to code a test for, so after putting criterion to roll this out, this is what I've got:

ImgUr album

Problem size 5*10^6

map  105.4 ms
map2 93.34 ms
map4 89.79 ms

Problem size 1*10^7

map  216.3 ms
map2 186.8 ms
map4 180.1 ms

Problem size 5*10^7

map  1050 ms
map2 913.7 ms
map4 899.8 ms

Well, it seems that unrolling had some effect^1! map4 appears to be 16% faster.

Time for the questions then:

  1. Have this been discussed before? Is something like that already implemented?
  2. Is it really the reduced number of evaluations of map4 that improves speed?
  3. Can this be automated?
  4. Could I evaluate by chunks? ie.: if (f x) is fully evaluated, fully evalute everything up to (f x4).
  5. Any other form this sort of unrolling come at play?
  6. How inflation on the function size could this lead to?
  7. Any short-commings on why this is not a good idea?

1: I`ve also unrolled fib, since this sort of optimization would also happen in that form, but the performance gain is cheating a (very) bad algorithm.

3
Do you mean inlining? Cause it sounds like you mean inlining. In which case yes, GHC aggressively inlines as do most high performance compilers, regardless of language.Daniel Gratzer
I don't think inlining does what I'm talking about, albeit is similar. AFAIU, inlining would inline f in (f x), effectively improving speed by reducing a function call, at the expense of code size. But IMO it wouldn't inline map in (f x):(map f xs) as it's behaviour is very diferent from f. Also, my test code is compiled with -O, so even if I'm mistaken of inline behaviour, it's not kicking in.MdxBhmt
I don't believe this makes a difference. Compiling with O2 causes them all to give the same results. Unrolling doesn't reduce the number of thunks (if map f (x:xs) produces one, so does f x:map f xs). I'm not sure what else could be affected.user2407038

3 Answers

6
votes

Did you compile with optimizations? For me, with -O2, there's not really a difference between these snippets: map1, map2, and map4 ran in 279, 267, and 285ms, respectively (and for comparison, map itself ran in 278ms). So that just looks like measurement noise rather than improvement to me.

That said, you might like to look at this GHC plugin which seems to be about loop unrolling.

It's sad but pretty true that pure functional languages and imperative languages tend to have very different optimization techniques. For example, you might want to look at stream fusion and deforestation -- two techniques that are pretty neat but don't translate very well to imperative languages.

And as for "Any short-commings on why this is not a good idea?", well, I can think of one right off the bat:

*Main> map succ (1:undefined)
[2*** Exception: Prelude.undefined
*Main> map4 succ (1:undefined)
*** Exception: Prelude.undefined

In many situations, making a function more strict in order to improve performance is fine, but here the performance win isn't that clear to me and map is often used in laziness-reliant ways.

3
votes

Along with the ghc unrolling plugin already mentioned, there is a page on the GHC trac which discusses peeling/unrolling. The "Open Issues" and "References" sections are particularly revealing sources of further research material.

1
votes

Loop unrolling is quite a blunt weapon. I'd never want your map example to be unrolled, for example. It is entirely dominated by memory allocation of the returned list and the thunks in its cells. I doesn't matter whether or not the register allocator has more to chew on. (Whether or not to unroll a fold like foldl' is perhaps a different question.)

GHC could achieve loop unrolling by inlining recursive functions. But it tries hard not to: In fact it will never inline "loop-breaker"(s) of a recursive group. Otherwise there is no gaurantee that inlining terminates whatsoever. See Section 4 of "Secrets of the GHC inliner".

GHC does apply a limited form of loop peeling (or rather, partial redundancy elimination) in its LiberateCase pass (run with -O2):

f :: (Int, Int) -> Int
f p = g [0..snd p]
  where
    g :: [Int] -> Int
    g [] = 0
    g (x:xs) = fst p + x + g xs

Here, GHC will peel one iteration of the loop to get the fst p projection out of the loop and reference an unboxed Int# instead. Core:

Lib.$wf
  = \ (ww_sNF :: Int) (ww1_sNJ :: GHC.Prim.Int#) ->
      case GHC.Enum.eftInt 0# ww1_sNJ of {
        [] -> 0#;
        : x_atE xs_atF ->
          case ww_sNF of { GHC.Types.I# x1_aLW ->
          case x_atE of { GHC.Types.I# y_aLZ ->
          letrec {
            $wg_sNB [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker]
              :: [Int] -> GHC.Prim.Int#
            [LclId, Arity=1, Str=<S,1*U>, Unf=OtherCon []]
            $wg_sNB
              = \ (w_sNx :: [Int]) ->
                  case w_sNx of {
                    [] -> 0#;
                    : x2_Xud xs1_Xuf ->
                      case x2_Xud of { GHC.Types.I# y1_XMG ->
                      case $wg_sNB xs1_Xuf of ww2_sNA { __DEFAULT ->
                      GHC.Prim.+# (GHC.Prim.+# x1_aLW y1_XMG) ww2_sNA
                      }
                      }
                  }; } in
          case $wg_sNB xs_atF of ww2_sNA { __DEFAULT ->
          GHC.Prim.+# (GHC.Prim.+# x1_aLW y_aLZ) ww2_sNA
          }
          }
          }
      }