I'm trying to understand how expensive a (green) thread in Haskell (GHC 7.10.1 on OS X 10.10.5) really is. I'm aware that its super cheap compared to a real OS thread, both for memory usage and for CPU.
Right, so I started writing a super simple program with forks n
(green) threads (using the excellent async
library) and then just sleeps each thread for m
seconds.
Well, that's easy enough:
$ cat PerTheadMem.hs
import Control.Concurrent (threadDelay)
import Control.Concurrent.Async (mapConcurrently)
import System.Environment (getArgs)
main = do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
mapConcurrently (\_ -> threadDelay (sleep*1000*1000)) [1..numThreads]
and first of all, let's compile and run it:
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.1
$ ghc -rtsopts -O3 -prof -auto-all -caf-all PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
that should fork 100k threads and wait 10s in each and then print us some information:
$ time ./PerTheadMem 100000 10 +RTS -sstderr
340,942,368 bytes allocated in the heap
880,767,000 bytes copied during GC
164,702,328 bytes maximum residency (11 sample(s))
21,736,080 bytes maximum slop
350 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 648 colls, 0 par 0.373s 0.415s 0.0006s 0.0223s
Gen 1 11 colls, 0 par 0.298s 0.431s 0.0392s 0.1535s
INIT time 0.000s ( 0.000s elapsed)
MUT time 79.062s ( 92.803s elapsed)
GC time 0.670s ( 0.846s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.065s ( 0.091s elapsed)
Total time 79.798s ( 93.740s elapsed)
%GC time 0.8% (0.9% elapsed)
Alloc rate 4,312,344 bytes per MUT second
Productivity 99.2% of total user, 84.4% of total elapsed
real 1m33.757s
user 1m19.799s
sys 0m2.260s
It took quite long (1m33.757s) given that each thread is supposed to only just wait for 10s but we've build it non-threaded so fair enough for now. All in all, we used 350 MB, that's not too bad, that's 3.5 KB per thread. Given that the initial stack size (-ki
is 1 KB).
Right, but now let's compile is in threaded mode and see if we can get any faster:
$ ghc -rtsopts -O3 -prof -auto-all -caf-all -threaded PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
3,996,165,664 bytes allocated in the heap
2,294,502,968 bytes copied during GC
3,443,038,400 bytes maximum residency (20 sample(s))
14,842,600 bytes maximum slop
3657 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6435 colls, 0 par 0.860s 1.022s 0.0002s 0.0028s
Gen 1 20 colls, 0 par 2.206s 2.740s 0.1370s 0.3874s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.879s ( 8.534s elapsed)
GC time 3.066s ( 3.762s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.074s ( 0.247s elapsed)
Total time 4.021s ( 12.545s elapsed)
Alloc rate 4,544,893,364 bytes per MUT second
Productivity 23.7% of total user, 7.6% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 0m12.565s
user 0m4.021s
sys 0m1.154s
Wow, much faster, just 12s now, way better. From Activity Monitor I saw that it roughly used 4 OS threads for the 100k green threads, which makes sense.
However, 3657 MB total memory ! That's 10x more than the non-threaded version used...
Up until now, I didn't do any profiling using -prof
or -hy
or so. To investigate a bit more I then did some heap profiling (-hy
) in separate runs. The memory usage didn't change in either case, the heap profiling graphs look interestingly different (left: non-threaded, right: threaded) but I can't find the reason for the 10x difference.
Diffing the profiling output (.prof
files) I can also not find any real difference.
Therefore my question: Where does the 10x difference in memory usage come from?
EDIT: Just to mention it: The same difference applies when the program isn't even compiled with profiling support. So running time ./PerTheadMem 100000 10 +RTS -sstderr
with ghc -rtsopts -threaded -fforce-recomp PerTheadMem.hs
is 3559 MB. And with ghc -rtsopts -fforce-recomp PerTheadMem.hs
it's 395 MB.
EDIT 2: On Linux (GHC 7.10.2
on Linux 3.13.0-32-generic #57-Ubuntu SMP, x86_64
) the same happens: Non-threaded 460 MB in 1m28.538s and threaded is 3483 MB is 12.604s. /usr/bin/time -v ...
reports Maximum resident set size (kbytes): 413684
and Maximum resident set size (kbytes): 1645384
respectively.
EDIT 3: Also changed the program to use forkIO
directly:
import Control.Concurrent (threadDelay, forkIO)
import Control.Concurrent.MVar
import Control.Monad (mapM_)
import System.Environment (getArgs)
main = do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
mvar <- newEmptyMVar
mapM_ (\_ -> forkIO $ threadDelay (sleep*1000*1000) >> putMVar mvar ())
[1..numThreads]
mapM_ (\_ -> takeMVar mvar) [1..numThreads]
And it doesn't change anything: non-threaded: 152 MB, threaded: 3308 MB.
time
to output memory statistics. What happens if you compile without profiling and ask the OS for memory stats? – MathematicalOrchid-sstderr
output didn't change. The pictures are from the latter two runs. Also I checked mem usage in Activity Monitor and I couldn't see a big diff between w/ and w/o profiling. – Johannes Weiss-prof -auto-all -caf-all
). – Johannes Weiss