Runtime performance degradation for C FFI Callback when pthreads are enabled

votes

I am curious about the behavior of GHC runtime with threaded option in case when C FFI calls back Haskell function. I wrote code to measure overhead of a basic function callback (below). While the function callback overhead has already been discussed before, I am curious about the sharp increase in total time I observed when multi-threading is enabled in C code (even when total number of function calls to Haskell remain same). In my test, I called Haskell function f 5M times using two scenarios (GHC 7.0.4, RHEL, 12-core box, runtime options below after the code):

Single thread in C create_threads function: call f 5M times - Total time 1.32s
5 threads in C create_threads function: each thread calls f 1M times - so, total is still 5M - Total time 7.79s

Code below - Haskell code below is for single-threaded C callback - comments explain how to update it for 5-thread testing:

t.hs:

{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Storable as SV
import Control.Monad (mapM, mapM_)
import Foreign.Ptr (Ptr, FunPtr, freeHaskellFunPtr)
import Foreign.C.Types (CInt)

f :: CInt -> ()
f x = ()

-- "wrapper" import is a converter for converting a Haskell function to a foreign function pointer
foreign import ccall "wrapper"
  wrap :: (CInt -> ()) -> IO (FunPtr (CInt -> ()))

foreign import ccall safe "mt.h create_threads"
  createThreads :: Ptr (FunPtr (CInt -> ())) -> Ptr CInt -> CInt -> IO()

main = do
  -- set threads=[1..5], l=1000000 for multi-threaded FFI callback testing
  let threads = [1..1]
      l = 5000000
      vl = SV.replicate (length threads) (fromIntegral l) -- make a vector of l
  lf <- mapM (\x -> wrap f ) threads -- wrap f into a funPtr and create a list
  let vf = SV.fromList lf -- create vector of FunPtr to f
  -- pass vector of function pointer to f, and vector of l to create_threads
  -- create_threads will spawn threads (equal to length of threads list)
  -- each pthread will call back f l times - then we can check the overhead
  SV.unsafeWith vf $ \x ->
    SV.unsafeWith vl $ \y -> createThreads x y (fromIntegral $ SV.length vl)
  SV.mapM_ freeHaskellFunPtr vf

mt.h:

#include <pthread.h>
#include <stdio.h>

typedef void(*FunctionPtr)(int);

/** Struct for passing argument to thread
**
**/
typedef struct threadArgs{
   int  threadId;
   FunctionPtr fn;
   int length;
} threadArgs;


/* This is our thread function.  It is like main(), but for a thread*/
void *threadFunc(void *arg);
void create_threads(FunctionPtr*,int*,int);

mt.c:

#include "mt.h"


/* This is our thread function.  It is like main(), but for a thread*/
void *threadFunc(void *arg)
{
  FunctionPtr fn;
  threadArgs args = *(threadArgs*) arg;
  int id = args.threadId;
  int length = args.length;
  fn = args.fn;
  int i;
  for (i=0; i < length;){
    fn(i++); //call haskell function
  }
}

void create_threads(FunctionPtr* fp, int* length, int numThreads )
{
  pthread_t pth[numThreads];  // this is our thread identifier
  threadArgs args[numThreads];
  int t;
  for (t=0; t < numThreads;){
    args[t].threadId = t;
    args[t].fn = *(fp + t);
    args[t].length = *(length + t);
    pthread_create(&pth[t],NULL,threadFunc,&args[t]);
    t++;
  }

  for (t=0; t < numThreads;t++){
    pthread_join(pth[t],NULL);
  }
  printf("All threads terminated\n");
}

Compilation (GHC 7.0.4, gcc 4.4.3 in case it is used by ghc):

 $ ghc -O2 t.hs mt.c -lpthread -threaded -rtsopts -optc-O2

Running with 1 thread in create_threads (the code above will do that) - I turned off parallel gc for testing:

$ ./t +RTS -s -N5 -g1
INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.04s  (  1.05s elapsed)
  GC    time    0.28s  (  0.28s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.32s  (  1.34s elapsed)

  %GC time      21.1%  (21.2% elapsed)

Running with 5 threads (see first comment in main function of t.hs above on how to edit it for 5 threads):

$ ./t +RTS -s -N5 -g1
INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    7.42s  (  2.27s elapsed)
  GC    time    0.36s  (  0.37s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    7.79s  (  2.63s elapsed)

  %GC time       4.7%  (13.9% elapsed)

I will appreciate insight into why the performance degrades with multiple pthreads in create_threads. I first suspected parallel GC but I turned it off for testing above. The MUT time too goes up sharply for multiple pthreads, given the same runtime options. So, it is not just GC.

Also, are there any improvements in GHC 7.4.1 for this kind of scenario?

I don't plan to call back Haskell from FFI that often, but it helps to understand the above issue, when designing Haskell/C mult-threaded library interaction.

haskellconcurrencyffi

I get a much smaller slowdown with 7.2.2, Total time 1.42s (1.42s elapsed) for single threaded vs. 2.58s (1.86s elapsed) with four threads (since I have only 2 physical cores with 4 Threads, I deemed it pointless to ask for five threads). So it will probably be better in 7.4.1. – Daniel Fischer

@DanielFischer, thanks for the pointer on 7.2.2 performance. May be I should download and compile 7.4.1RC on RHEL to see how it performs. That is quite time-consuming endeavor though. – Sal

I believe they have pre-built binaries also for the release candidates. That wouldn't be so time-consuming, I think. Or don't the vanilla binaries work on RHEL? – Daniel Fischer

@DanielFischer, vanilla binary won't work on RHEL5 because of older glibc version than the one binary is compiled with. – Sal

1 Answers

votes

I believe the key question here is, how does the GHC runtime schedule C callbacks into Haskell? Although I don't know for certain, my suspicion is that all C callbacks are handled by the Haskell thread that originally made the foreign call, at least up to ghc-7.2.1 (which I'm using).

This would explain the large slowdown you (and I) see when moving from 1 thread to 5. If the five threads are all calling back into the same Haskell thread, there will be significant contention on that Haskell thread to complete all the callbacks.

In order to test this, I modified your code so that Haskell forks a new thread before calling create_threads, and create_threads only spawns one thread per call. If I'm correct, each OS thread will have a dedicated Haskell thread to perform work, so there should be much less contention. Although this still takes almost twice as long as the single-thread version, it's significantly faster than the original multi-threaded version, which lends some evidence to this theory. The difference is much less if I turn off thread migration with +RTS -qm.

As Daniel Fischer reports different results for ghc-7.2.2, I would expect that version changes how Haskell schedules callbacks. Maybe somebody on the ghc-users list can provide more information on this; I don't see anything likely in the release notes for 7.2.2 or 7.4.1.