I wrote a minimal example trying to parallelize some computation in Haskell. I basically tried to map some trig functions onto a long list by
- split the list in half
- evaluate the two halves in parallel
- concatenate them
- fold the list
My code:
-- in file par.hs
module Par where
import Control.Parallel
import Data.List
parmap f l = let
halflen = floor $ (realToFrac $ length l) / 2
half1 = take halflen l
half2 = drop halflen l
mapped1 = map f half1
mapped2 = map f half2
in mapped1 `par` (mapped2 `pseq` mapped1 ++ mapped2)
forceEval x = sin x + cos x - tan x + sin(3*x) - cos(2*x)
runMap :: Double
runMap = foldl' (+) 0.0 $ parmap forceEval [0.0..2000000.0]
main = do
putStrLn $ show runMap
I compiled it with ghc -prof -fprof-auto -rtsopts -threaded par.hs
I ran it with ./par +RTS -p -N?
where ?
is number of processors
I then looked at the generated par.prof
file.
Below are a couple of profiling outputs I got. I ran and profiled multiple times, so I'm pretty sure these numbers were not outliers.
Running with -N1
gave me:
Tue May 7 23:20 2019 Time and Allocation Profiling Report (Final)
par +RTS -N1 -p -RTS
total time = 1.20 secs (1200 ticks @ 1000 us, 1 processor)
total alloc = 1,936,132,144 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
forceEval Main par.hs:28:1-57 75.8 60.3
runMap Main par.hs:31:1-59 17.5 19.0
parmap' Main par.hs:(6,1)-(12,56) 3.2 14.9
parmap'.half1 Main par.hs:8:5-26 2.3 5.8
parmap'.half2 Main par.hs:9:5-26 1.1 0.0
Running with -N2
(note how the profile indicates a more than a 2x speedup?):
Tue May 7 23:24 2019 Time and Allocation Profiling Report (Final)
par +RTS -N2 -p -RTS
total time = 0.36 secs (725 ticks @ 1000 us, 2 processors)
total alloc = 1,936,149,368 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
forceEval Main par.hs:28:1-57 70.6 60.3
runMap Main par.hs:31:1-59 19.3 19.0
parmap' Main par.hs:(6,1)-(12,56) 4.3 14.9
parmap'.half1 Main par.hs:8:5-26 3.3 5.8
parmap'.half2 Main par.hs:9:5-26 1.7 0.0
Running with -N4
(note the slight more speedup compared to -N2
):
Tue May 7 23:25 2019 Time and Allocation Profiling Report (Final)
par +RTS -N4 -p -RTS
total time = 0.27 secs (1098 ticks @ 1000 us, 4 processors)
total alloc = 1,936,183,704 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
forceEval Main par.hs:28:1-57 71.5 60.3
runMap Main par.hs:31:1-59 19.3 19.0
parmap' Main par.hs:(6,1)-(12,56) 3.8 14.9
parmap'.half1 Main par.hs:8:5-26 3.6 5.8
parmap'.half2 Main par.hs:9:5-26 1.2 0.0
I expected to see some speedup when running on two processors, but no additional speedup if I run on more processors. But I couldn't visually observe any speedup at all, yet the GHC profiles above told me that there was a speed up - an unrealistically good one, plus additional speedup if I use more than two processors?
In fact in another program where I attempted to parallelize computation, I was compiling and profiling using stack, and observed such similar false speedup as well.
I'd really appreciate if someone could explain to me what went on: Is this the right way of writing parallel Haskell code? Why could running with 4 cores be helpful in runtime if I only had two parts to evaluate in parallel? Or did I just misinterpret the profiling output?
Thanks in advance for any help.