17
votes

I've noticed a significant overhead calling Haskell functions in C, much larger than the overhead of a native C function call. To distill the issue to its essence, I wrote a program that just initializes the Haskell runtime, runs a loop that calls an empty function 100,000,000 times, and returns.

With the function inlined, the program takes 0.003s. Calling an empty function written in C takes 0.18s. Calling an empty function written in Haskell takes 15.5s. (Strangely, if I compile the empty Haskell file separately before linking, it takes a few more seconds. Sub-question: why is this?)

So it looks like there's about a 100-fold slowdown between calling a C function and calling a Haskell function. What is the reason for this, and is there a way to mitigate this slowdown?

Code

EDIT: I've discovered a version of this test in the NoFib benchmark suite, callback002. There's a nice blog post by Edward Z. Yang mentioning this test in the context of the GHC scheduler. I'm still trying to grok this blog post along with Zeta's very nice answer. I'm not yet convinced that there's not a way to do this faster!

To compile the "slow" Haskell version, run

ghc -no-hs-main -O2 -optc-O3 test.c Test.hs -o test

To compile the "fast" C version, run

ghc -no-hs-main -O2 -optc-O3 test.c test2.c TestDummy.hs -o test

test.c:

#include "HsFFI.h"
extern void __stginit_Test(void);

extern void test();

int main(int argc, char *argv[]) {
  hs_init(&argc, &argv);
  hs_add_root(__stginit_Test);
  int i;
  for (i = 0; i < 100000000; i++) {
    test();
  }
  hs_exit();
  return 0;
}

test2.c:

void test() {
}

Test.hs:

{-# LANGUAGE ForeignFunctionInterface #-}

module Test where

foreign export ccall test :: ()

test :: ()
test = ()

TestDummy.hs:

module Test where
1
There's quite a bit of work needed to make sure you can allocate memory etc. in the Haskell function.augustss
Disassembling the test symbol produced by the Haskell code could be both educational and a great blog post.Thomas M. DuBuisson
Out of perverse curiosity... does compiling with -threaded make any difference either way?MathematicalOrchid
@MathematicalOrchid: Nope. Maybe in the linked RTS variant, but not in the object code.Zeta
@Zeta I was just wondering if it affects the timings. I know the two RTS ways differ in how they handle foreign calls... That said, I wouldn't be massively surprised if it makes no measurable difference.MathematicalOrchid

1 Answers

18
votes

TL;DR: Reason: RTS and STG calls. Solution: don't call trivial Haskell functions from C.


What is the reason for this…?

Disclaimer: I've never called Haskell from C. I'm familiar with C and Haskell, but I rarely intertwine both, unless I'm writing a wrapper. Now that I've lost my credibility, let's start this adventure of benchmarking, disassembling and other nifty horrors.

Benchmarking with gprof

One easy way to check what's eating all of your resources is to use gprof. We'll change your compile line slightly so that -pg is used both by the compiler and the linker (note: I've renamed test.c to main.c and test2.c to test.c):

$ ghc -no-hs-main -O2 -optc-O3 -optc-pg -optl-pg -fforce-recomp \
    main.c Test.hs -o test
$ ./test
$ gprof ./test

This gives us the following (flat) profile:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 16.85      2.15     2.15                             scheduleWaitThread
 11.78      3.65     1.50                             createStrictIOThread
  7.66      4.62     0.98                             createThread
  6.68      5.47     0.85                             allocate
  5.66      6.19     0.72                             traverseWeakPtrList
  5.34      6.87     0.68                             isAlive
  4.12      7.40     0.53                             newBoundTask
  3.06      7.79     0.39                             stg_ap_p_fast
  2.36      8.09     0.30                             stg_ap_v_info
  1.96      8.34     0.25                             stg_ap_0_fast
  1.85      8.57     0.24                             rts_checkSchedStatus
  1.81      8.80     0.23                             stg_PAP_apply
  1.73      9.02     0.22                             rts_apply
  1.73      9.24     0.22                             stg_enter_info
  1.65      9.45     0.21                             stg_stop_thread_info
  1.61      9.66     0.21                             test
  1.49      9.85     0.19                             stg_returnToStackTop
  1.49     10.04     0.19                             move_STACK
  1.49     10.23     0.19                             stg_ap_v_fast
  1.41     10.41     0.18                             rts_lock
  1.18     10.56     0.15                             boundTaskExiting
  1.10     10.70     0.14                             StgRun
  0.98     10.82     0.13                             rts_evalIO
  0.94     10.94     0.12                             stg_upd_frame_info
  0.79     11.04     0.10                             blockedThrowTo
  0.67     11.13     0.09                             StgReturn
  0.63     11.21     0.08                             createIOThread
  0.63     11.29     0.08                             stg_bh_upd_frame_info
  0.63     11.37     0.08                             c5KU_info
  0.55     11.44     0.07                             stg_stk_save_n
  0.51     11.50     0.07                             threadPaused
  0.47     11.56     0.06                             dirty_TSO
  0.47     11.62     0.06                             ghczmprim_GHCziCString_unpackCStringzh_info
  0.47     11.68     0.06                             stopHeapProfTimer
  0.39     11.73     0.05                             stg_threadFinished
  0.39     11.78     0.05                             allocGroup
  0.39     11.83     0.05                             base_GHCziTopHandler_runNonIO1_info
  0.39     11.88     0.05                             stg_catchzh
  0.35     11.93     0.05                             freeMyTask
  0.35     11.97     0.05                             rts_eval_
  0.31     12.01     0.04                             awakenBlockedExceptionQueue
  0.31     12.05     0.04                             stg_ap_2_upd_info
  0.27     12.09     0.04                             s5q4_info
  0.24     12.12     0.03                             markStableTables
  0.24     12.15     0.03                             rts_getSchedStatus
  0.24     12.18     0.03                             s5q3_info
  0.24     12.21     0.03                             scavenge_stack
  0.24     12.24     0.03                             stg_ap_7_upd_info
  0.24     12.27     0.03                             stg_ap_n_fast
  0.24     12.30     0.03                             stg_gc_noregs
  0.20     12.32     0.03                             base_GHCziTopHandler_runIO1_info
  0.20     12.35     0.03                             stat_exit
  0.16     12.37     0.02                             GarbageCollect
  0.16     12.39     0.02                             dirty_STACK
  0.16     12.41     0.02                             freeGcThreads
  0.16     12.43     0.02                             rts_mkString
  0.16     12.45     0.02                             scavenge_capability_mut_lists
  0.16     12.47     0.02                             startProfTimer
  0.16     12.49     0.02                             stg_PAP_info
  0.16     12.51     0.02                             stg_ap_stk_p
  0.16     12.53     0.02                             stg_catch_info
  0.16     12.55     0.02                             stg_killMyself
  0.16     12.57     0.02                             stg_marked_upd_frame_info
  0.12     12.58     0.02                             interruptAllCapabilities
  0.12     12.60     0.02                             scheduleThreadOn
  0.12     12.61     0.02                             waitForReturnCapability
  0.08     12.62     0.01                             exitStorage
  0.08     12.63     0.01                             freeWSDeque
  0.08     12.64     0.01                             gcStableTables
  0.08     12.65     0.01                             resetTerminalSettings
  0.08     12.66     0.01                             resizeNurseriesEach
  0.08     12.67     0.01                             scavenge_loop
  0.08     12.68     0.01                             split_free_block
  0.08     12.69     0.01                             startHeapProfTimer
  0.08     12.70     0.01                             stg_MVAR_TSO_QUEUE_info
  0.08     12.71     0.01                             stg_forceIO_info
  0.08     12.72     0.01                             zero_static_object_list
  0.04     12.73     0.01                             frame_dummy
  0.04     12.73     0.01                             rts_evalLazyIO_
  0.00     12.73     0.00        1     0.00     0.00  stginit_export_Test_zdfstableZZC0ZZCmainZZCTestZZCtest

Woah, that's a bunch of functions getting called. How does this compare to your C version?

$ ghc -no-hs-main -O2 -optc-O3 -optc-pg -optl-pg -fforce-recomp \
    main.c TestDummy.hs test.c -o test_c
$ ./test_c
$ gprof ./test_c
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 75.00      0.05     0.05                             test
 25.00      0.06     0.02                             frame_dummy

That's a lot simpler. But why?

What's happening behind?

Maybe you've wondered why test even showed up in the previous profile. Well, gprof itself adds some overhead, as can be seen with objdump:

$ objdump -D ./test_c | grep -A5 "<test>:"
0000000000405630 <test>:
  405630:   55                      push   %rbp
  405631:   48 89 e5                mov    %rsp,%rbp
  405634:   e8 f7 d4 ff ff          callq  402b30 <mcount@plt>
  405639:   5d                      pop    %rbp
  40563a:   c3                      retq   

The call of mcount is added by gcc. So for the next part you want to remove the -pg options. Let's check the disassembled test routine in C first:

$ ghc -no-hs-main -O2 -optc-O3 -fforce-recomp \ 
    main.c TestDummy.hs test.c -o test_c
$ objdump -D ./test_c | grep -A2 "<test>:"
0000000000405510 <test>:
  405510:   f3 c3                   repz retq

The repz retq is actually some optimisation magic, but in this case you can think of it as a (mostly) no-op return.

How does this compare to the Haskell version?

$ ghc -no-hs-main -O2 -optc-O3 -fforce-recomp \ 
    main.c Test.hs -o test_hs    
$ objdump -D ./Test.o | grep -A18 "<test>:"
0000000000405520 <test>:
  405520:   48 83 ec 18             sub    $0x18,%rsp
  405524:   e8 f7 3a 05 00          callq  459020 <rts_lock>
  405529:   ba 58 24 6b 00          mov    $0x6b2458,%edx
  40552e:   be 80 28 6b 00          mov    $0x6b2880,%esi
  405533:   48 89 c7                mov    %rax,%rdi
  405536:   48 89 04 24             mov    %rax,(%rsp)
  40553a:   e8 51 36 05 00          callq  458b90 <rts_apply>
  40553f:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  405544:   48 89 c6                mov    %rax,%rsi
  405547:   48 89 e7                mov    %rsp,%rdi
  40554a:   e8 01 39 05 00          callq  458e50 <rts_evalIO>
  40554f:   48 8b 34 24             mov    (%rsp),%rsi
  405553:   bf 64 57 48 00          mov    $0x485764,%edi
  405558:   e8 23 3a 05 00          callq  458f80 <rts_checkSchedStatus>
  40555d:   48 8b 3c 24             mov    (%rsp),%rdi
  405561:   e8 0a 3b 05 00          callq  459070 <rts_unlock>
  405566:   48 83 c4 18             add    $0x18,%rsp
  40556a:   c3                      retq   
  40556b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  405570:   d8 ce                   fmul   %st(6),%st

This looks rather different. Indeed, the RTS functions seem suspicious. Let's have a look at them:

  • rts_checkSchedStatus just checks whether the status is fine, and exits otherwise. The Success path doesn't have much overhead, so this function isn't really the culprit.
  • rts_unlock and rts_lock basically claim and release a capability (a virtual CPU). They call newBoundTask and boundTaskExiting, which take some time (see profile above).
  • rts_apply calls allocate, one of the most often used functions throughout the program (which isn't really surprising, with Haskell being garbage collected).
  • rts_evalIO finally creates the actual thread and waits for its completion. So we can estimate that rts_evalIO alone takes about 27%.

So we found all the functions that are taking all the time. The STG and the RTS take all the credit for the overhead of 150ns per call.

…and is there a way to mitigate this slowdown?

Well, your test is basically a no-op. You're calling it 100000000 times, with a total runtime of 15s. Compared to the C version, that's an overhead of ~149ns per call.

The solution is quite simple: don't use Haskell functions in your C world for trivial tasks. Use the right tool for the right situation. After all, you don't use the GMP library if you need to add two numbers that are guaranteed to be less than 10.

Apart from this paradigmatic solution: no. The assembly shown above is what's created by GHC, and it's not possible to create a variant without RTS calls at the moment.