48
votes

I have some high-performance Haskell code - the inner loop compiles down to 6 assembly instructions. Modifying the inner loop to be less efficient doesn't have any noticable impact on the performance, suggesting the inner loop is not the bottleneck. However, when I turn on profiling, the assembly code generated for the inner loop becomes dramatically less efficient, and the profiler reports that the inner loop takes 85% of the time.

I suspect something is unnecessarily slow, but when I use profiling to see what, I suspect that profiling makes the inner loop sufficiently slow that it dominates. What techniques can I use to see where the time is going? A sampling profiler would be great, if one existed for Haskell.

1
Maybe if you add cost centres manually, you could not add a cost centre to that loop and so it won't be slower? Not sure if that works thoughbennofs
Good idea, but it doesn't seem to work. Even with no cost centers at all, just adding -prof puts a call enterFunCCS in the inner loop, plus about 5 other additional instructions (but the call will be the one that kills it).Neil Mitchell
I have some hacks for doing sampling. Basically you just link with the profiling startup code. The problem is that the generated .prof file makes gprof die. Probably because gprof has some assumptions about the symbols. But a hacked up gprof that knows about ghc symbols should be able to work.augustss
Would it be possible to comment out the loop? That way you could profile the rest of the code, without having the loop messing the profile.Pedro Rodrigues
Maybe you could try using "ticky-ticky" profiling: ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky (I have no experience doing this, just stumbled over it on the GHC wiki)bennofs

1 Answers

24
votes

You can use Linux perf events: https://ghc.haskell.org/trac/ghc/wiki/Debugging/LowLevelProfiling/Perf

This will give you an output that looks like:

# Samples: 9161149923
#
# Overhead  Command      Shared Object  Symbol
# ........  .......  .................  ......
#
    30.65%   queens  queens             [.] s1ql_info
    18.67%   queens  queens             [.] s1qj_info
    12.17%   queens  queens             [.] s1qi_info
     9.94%   queens  queens             [.] s1o9_info
     5.85%   queens  queens             [.] r1nI_info
     5.33%   queens  queens             [.] s1sF_info
     5.18%   queens  queens             [.] s1sG_info
     3.69%   queens  queens             [.] s1oP_info
     1.68%   queens  queens             [.] stg_upd_frame_info
     0.88%   queens  queens             [.] stg_ap_2_upd_info
     0.62%   queens  queens             [.] s1sE_info
     0.56%   queens  [kernel]           [k] read_hpet
     0.39%   queens  queens             [.] stg_ap_p_info
     0.35%    :2030             f76beb  [.] 0x00000000f76beb
     0.31%   queens  queens             [.] s1oD_info
     0.28%  swapper  [kernel]           [k] mwait_idle_with_hints
     0.25%   queens  queens             [.] __stg_gc_enter_1
     0.23%   queens  queens             [.] evacuate
     0.18%  swapper  [kernel]           [k] read_hpet
     0.12%   queens  queens             [.] scavenge_block

If you save the core as you compile you can map these symbols back to the functions in core.

A bit painful, but gives you more trustworthy results.

There is some work afoot to do this automatically.