TL;DR: Reason: RTS and STG calls. Solution: don't call trivial Haskell functions from C.
What is the reason for this…?
Disclaimer: I've never called Haskell from C. I'm familiar with C and Haskell, but I rarely intertwine both, unless I'm writing a wrapper. Now that I've lost my credibility, let's start this adventure of benchmarking, disassembling and other nifty horrors.
Benchmarking with gprof
One easy way to check what's eating all of your resources is to use gprof. We'll change your compile line slightly so that -pg
is used both by the compiler and the linker (note: I've renamed test.c to main.c and test2.c to test.c):
$ ghc -no-hs-main -O2 -optc-O3 -optc-pg -optl-pg -fforce-recomp \
main.c Test.hs -o test
$ ./test
$ gprof ./test
This gives us the following (flat) profile:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
16.85 2.15 2.15 scheduleWaitThread
11.78 3.65 1.50 createStrictIOThread
7.66 4.62 0.98 createThread
6.68 5.47 0.85 allocate
5.66 6.19 0.72 traverseWeakPtrList
5.34 6.87 0.68 isAlive
4.12 7.40 0.53 newBoundTask
3.06 7.79 0.39 stg_ap_p_fast
2.36 8.09 0.30 stg_ap_v_info
1.96 8.34 0.25 stg_ap_0_fast
1.85 8.57 0.24 rts_checkSchedStatus
1.81 8.80 0.23 stg_PAP_apply
1.73 9.02 0.22 rts_apply
1.73 9.24 0.22 stg_enter_info
1.65 9.45 0.21 stg_stop_thread_info
1.61 9.66 0.21 test
1.49 9.85 0.19 stg_returnToStackTop
1.49 10.04 0.19 move_STACK
1.49 10.23 0.19 stg_ap_v_fast
1.41 10.41 0.18 rts_lock
1.18 10.56 0.15 boundTaskExiting
1.10 10.70 0.14 StgRun
0.98 10.82 0.13 rts_evalIO
0.94 10.94 0.12 stg_upd_frame_info
0.79 11.04 0.10 blockedThrowTo
0.67 11.13 0.09 StgReturn
0.63 11.21 0.08 createIOThread
0.63 11.29 0.08 stg_bh_upd_frame_info
0.63 11.37 0.08 c5KU_info
0.55 11.44 0.07 stg_stk_save_n
0.51 11.50 0.07 threadPaused
0.47 11.56 0.06 dirty_TSO
0.47 11.62 0.06 ghczmprim_GHCziCString_unpackCStringzh_info
0.47 11.68 0.06 stopHeapProfTimer
0.39 11.73 0.05 stg_threadFinished
0.39 11.78 0.05 allocGroup
0.39 11.83 0.05 base_GHCziTopHandler_runNonIO1_info
0.39 11.88 0.05 stg_catchzh
0.35 11.93 0.05 freeMyTask
0.35 11.97 0.05 rts_eval_
0.31 12.01 0.04 awakenBlockedExceptionQueue
0.31 12.05 0.04 stg_ap_2_upd_info
0.27 12.09 0.04 s5q4_info
0.24 12.12 0.03 markStableTables
0.24 12.15 0.03 rts_getSchedStatus
0.24 12.18 0.03 s5q3_info
0.24 12.21 0.03 scavenge_stack
0.24 12.24 0.03 stg_ap_7_upd_info
0.24 12.27 0.03 stg_ap_n_fast
0.24 12.30 0.03 stg_gc_noregs
0.20 12.32 0.03 base_GHCziTopHandler_runIO1_info
0.20 12.35 0.03 stat_exit
0.16 12.37 0.02 GarbageCollect
0.16 12.39 0.02 dirty_STACK
0.16 12.41 0.02 freeGcThreads
0.16 12.43 0.02 rts_mkString
0.16 12.45 0.02 scavenge_capability_mut_lists
0.16 12.47 0.02 startProfTimer
0.16 12.49 0.02 stg_PAP_info
0.16 12.51 0.02 stg_ap_stk_p
0.16 12.53 0.02 stg_catch_info
0.16 12.55 0.02 stg_killMyself
0.16 12.57 0.02 stg_marked_upd_frame_info
0.12 12.58 0.02 interruptAllCapabilities
0.12 12.60 0.02 scheduleThreadOn
0.12 12.61 0.02 waitForReturnCapability
0.08 12.62 0.01 exitStorage
0.08 12.63 0.01 freeWSDeque
0.08 12.64 0.01 gcStableTables
0.08 12.65 0.01 resetTerminalSettings
0.08 12.66 0.01 resizeNurseriesEach
0.08 12.67 0.01 scavenge_loop
0.08 12.68 0.01 split_free_block
0.08 12.69 0.01 startHeapProfTimer
0.08 12.70 0.01 stg_MVAR_TSO_QUEUE_info
0.08 12.71 0.01 stg_forceIO_info
0.08 12.72 0.01 zero_static_object_list
0.04 12.73 0.01 frame_dummy
0.04 12.73 0.01 rts_evalLazyIO_
0.00 12.73 0.00 1 0.00 0.00 stginit_export_Test_zdfstableZZC0ZZCmainZZCTestZZCtest
Woah, that's a bunch of functions getting called. How does this compare to your C version?
$ ghc -no-hs-main -O2 -optc-O3 -optc-pg -optl-pg -fforce-recomp \
main.c TestDummy.hs test.c -o test_c
$ ./test_c
$ gprof ./test_c
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
75.00 0.05 0.05 test
25.00 0.06 0.02 frame_dummy
That's a lot simpler. But why?
What's happening behind?
Maybe you've wondered why test
even showed up in the previous profile. Well, gprof itself adds some overhead, as can be seen with objdump
:
$ objdump -D ./test_c | grep -A5 "<test>:"
0000000000405630 <test>:
405630: 55 push %rbp
405631: 48 89 e5 mov %rsp,%rbp
405634: e8 f7 d4 ff ff callq 402b30 <mcount@plt>
405639: 5d pop %rbp
40563a: c3 retq
The call of mcount
is added by gcc. So for the next part you want to remove the -pg
options. Let's check the disassembled test
routine in C first:
$ ghc -no-hs-main -O2 -optc-O3 -fforce-recomp \
main.c TestDummy.hs test.c -o test_c
$ objdump -D ./test_c | grep -A2 "<test>:"
0000000000405510 <test>:
405510: f3 c3 repz retq
The repz retq
is actually some optimisation magic, but in this case you can think of it as a (mostly) no-op return.
How does this compare to the Haskell version?
$ ghc -no-hs-main -O2 -optc-O3 -fforce-recomp \
main.c Test.hs -o test_hs
$ objdump -D ./Test.o | grep -A18 "<test>:"
0000000000405520 <test>:
405520: 48 83 ec 18 sub $0x18,%rsp
405524: e8 f7 3a 05 00 callq 459020 <rts_lock>
405529: ba 58 24 6b 00 mov $0x6b2458,%edx
40552e: be 80 28 6b 00 mov $0x6b2880,%esi
405533: 48 89 c7 mov %rax,%rdi
405536: 48 89 04 24 mov %rax,(%rsp)
40553a: e8 51 36 05 00 callq 458b90 <rts_apply>
40553f: 48 8d 54 24 08 lea 0x8(%rsp),%rdx
405544: 48 89 c6 mov %rax,%rsi
405547: 48 89 e7 mov %rsp,%rdi
40554a: e8 01 39 05 00 callq 458e50 <rts_evalIO>
40554f: 48 8b 34 24 mov (%rsp),%rsi
405553: bf 64 57 48 00 mov $0x485764,%edi
405558: e8 23 3a 05 00 callq 458f80 <rts_checkSchedStatus>
40555d: 48 8b 3c 24 mov (%rsp),%rdi
405561: e8 0a 3b 05 00 callq 459070 <rts_unlock>
405566: 48 83 c4 18 add $0x18,%rsp
40556a: c3 retq
40556b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
405570: d8 ce fmul %st(6),%st
This looks rather different. Indeed, the RTS functions seem suspicious. Let's have a look at them:
rts_checkSchedStatus
just checks whether the status is fine, and exits otherwise. The Success
path doesn't have much overhead, so this function isn't really the culprit.
rts_unlock
and rts_lock
basically claim and release a capability (a virtual CPU). They call newBoundTask
and boundTaskExiting
, which take some time (see profile above).
rts_apply
calls allocate
, one of the most often used functions throughout the program (which isn't really surprising, with Haskell being garbage collected).
rts_evalIO
finally creates the actual thread and waits for its completion. So we can estimate that rts_evalIO
alone takes about 27%.
So we found all the functions that are taking all the time. The STG and the RTS take all the credit for the overhead of 150ns per call.
…and is there a way to mitigate this slowdown?
Well, your test
is basically a no-op. You're calling it 100000000 times, with a total runtime of 15s. Compared to the C version, that's an overhead of ~149ns per call.
The solution is quite simple: don't use Haskell functions in your C world for trivial tasks. Use the right tool for the right situation. After all, you don't use the GMP library if you need to add two numbers that are guaranteed to be less than 10.
Apart from this paradigmatic solution: no. The assembly shown above is what's created by GHC, and it's not possible to create a variant without RTS calls at the moment.
test
symbol produced by the Haskell code could be both educational and a great blog post. – Thomas M. DuBuisson-threaded
make any difference either way? – MathematicalOrchid