Can't observe speedup on parallel Haskell code but profiling indicates otherwise

Question

I wrote a minimal example trying to parallelize some computation in Haskell. I basically tried to map some trig functions onto a long list by

split the list in half
evaluate the two halves in parallel
concatenate them
fold the list

My code:

-- in file par.hs

module Par where

import Control.Parallel
import Data.List

parmap f l = let
    halflen = floor $ (realToFrac $ length l) / 2
    half1 = take halflen l
    half2 = drop halflen l
    mapped1 = map f half1
    mapped2 = map f half2
    in mapped1 `par` (mapped2 `pseq` mapped1 ++ mapped2)

forceEval x = sin x + cos x - tan x + sin(3*x) - cos(2*x)

runMap :: Double
runMap = foldl' (+) 0.0 $ parmap forceEval [0.0..2000000.0]

main = do
    putStrLn $ show runMap

I compiled it with ghc -prof -fprof-auto -rtsopts -threaded par.hs

I ran it with ./par +RTS -p -N? where ? is number of processors

I then looked at the generated par.prof file.

Below are a couple of profiling outputs I got. I ran and profiled multiple times, so I'm pretty sure these numbers were not outliers.

Running with -N1 gave me:

    Tue May  7 23:20 2019 Time and Allocation Profiling Report  (Final)

       par +RTS -N1 -p -RTS

    total time  =        1.20 secs   (1200 ticks @ 1000 us, 1 processor)
    total alloc = 1,936,132,144 bytes  (excludes profiling overheads)

COST CENTRE   MODULE    SRC                   %time %alloc

forceEval     Main      par.hs:28:1-57         75.8   60.3
runMap        Main      par.hs:31:1-59         17.5   19.0
parmap'       Main      par.hs:(6,1)-(12,56)    3.2   14.9
parmap'.half1 Main      par.hs:8:5-26           2.3    5.8
parmap'.half2 Main      par.hs:9:5-26           1.1    0.0

Running with -N2 (note how the profile indicates a more than a 2x speedup?):

    Tue May  7 23:24 2019 Time and Allocation Profiling Report  (Final)

       par +RTS -N2 -p -RTS

    total time  =        0.36 secs   (725 ticks @ 1000 us, 2 processors)
    total alloc = 1,936,149,368 bytes  (excludes profiling overheads)

COST CENTRE   MODULE    SRC                   %time %alloc

forceEval     Main      par.hs:28:1-57         70.6   60.3
runMap        Main      par.hs:31:1-59         19.3   19.0
parmap'       Main      par.hs:(6,1)-(12,56)    4.3   14.9
parmap'.half1 Main      par.hs:8:5-26           3.3    5.8
parmap'.half2 Main      par.hs:9:5-26           1.7    0.0

Running with -N4 (note the slight more speedup compared to -N2):

    Tue May  7 23:25 2019 Time and Allocation Profiling Report  (Final)

       par +RTS -N4 -p -RTS

    total time  =        0.27 secs   (1098 ticks @ 1000 us, 4 processors)
    total alloc = 1,936,183,704 bytes  (excludes profiling overheads)

COST CENTRE   MODULE    SRC                   %time %alloc

forceEval     Main      par.hs:28:1-57         71.5   60.3
runMap        Main      par.hs:31:1-59         19.3   19.0
parmap'       Main      par.hs:(6,1)-(12,56)    3.8   14.9
parmap'.half1 Main      par.hs:8:5-26           3.6    5.8
parmap'.half2 Main      par.hs:9:5-26           1.2    0.0

I expected to see some speedup when running on two processors, but no additional speedup if I run on more processors. But I couldn't visually observe any speedup at all, yet the GHC profiles above told me that there was a speed up - an unrealistically good one, plus additional speedup if I use more than two processors?

In fact in another program where I attempted to parallelize computation, I was compiling and profiling using stack, and observed such similar false speedup as well.

I'd really appreciate if someone could explain to me what went on: Is this the right way of writing parallel Haskell code? Why could running with 4 cores be helpful in runtime if I only had two parts to evaluate in parallel? Or did I just misinterpret the profiling output?

Thanks in advance for any help.

talex talex · Accepted Answer · 2019-05-08T04:45:44

When par compute it's first argument it just force top level node to compute, but it can still refer to some lazy computation.

If you change it to

import Control.Parallel
import Data.List

parmap f a l = let
    halflen = floor $ (realToFrac $ length l) / 2
    half1 = take halflen l
    half2 = drop halflen l
    mapped1 = map f half1
    mapped2 = map f half2
    agg1 = a mapped1
    agg2 = a mapped2
    in agg1 `par` agg2 `pseq` a [agg1, agg2] 

forceEval x = sin x + cos x - tan x + sin(3*x) - cos(2*x)                   

runMap :: Double
runMap = parmap forceEval (foldl' (+) 0.0) [0.0..20000000.0]

main = do
    putStrLn $ show runMap

You will see speedup when switch from one thread

time ./f +RTS -N1
-4615093.834471449

real    0m13.077s
user    0m12.333s
sys 0m0.744s

to two thread

time ./f +RTS -N2
-4615093.834471449

real    0m9.057s
user    0m14.512s
sys 0m2.170s

In my example agg1 and agg2 is primitive values and it is easier to force their computation.

Alternative

You can use rnf from Control.DeepSeq to force whole list to be computed.

import Control.Parallel 
import Control.DeepSeq
import Data.List

parmap f l = let
    halflen = floor $ (realToFrac $ length l) / 2
    half1 = take halflen l
    half2 = drop halflen l
    mapped1 = map f half1
    mapped2 = map f half2
    in (rnf mapped1) `par` (rnf mapped2) `pseq` (mapped1 ++ mapped2)

forceEval x = sin x + cos x - tan x + sin(3*x) - cos(2*x)

runMap :: Double
runMap = foldl' (+) 0.0 $ parmap forceEval [0.0..20000000.0]

main = do
    putStrLn $ show runMap

And see performance improvement.

One thread

time ./f2 +RTS -N1
-4615093.83447202

real    0m15.241s
user    0m14.261s
sys 0m0.980s

Two threads

time ./f2 +RTS -N2
-4615093.83447202

real    0m11.640s
user    0m17.092s
sys 0m3.178s

Can't observe speedup on parallel Haskell code but profiling indicates otherwise

1 Answers