2
votes

I am pretty new to Haskell threads (and parallel programming in general) and I am not sure why my parallel version of an algorithm runs slower than the corresponding sequential version.

The algorithm tries to find all k-combinations without using recursion. For this, I am using this helper function, which given a number with k bits set, returns the next number with the same number of bits set:

import Data.Bits    

nextKBitNumber :: Integer -> Integer
nextKBitNumber n
  | n == 0      = 0
  | otherwise   = ripple .|. ones
                       where smallest     = n .&. (-n)
                             ripple       = n + smallest
                             newSmallest  = ripple .&. (-ripple)
                             ones         = (newSmallest `div` smallest) `shiftR` 1 - 1

It is now easy to obtain sequentially all k-combinations in the range [(2^k - 1), (2^(n-k)+...+ 2^(n-1)):

import qualified Data.Stream as ST

combs :: Int -> Int -> [Integer]
combs n k = ST.takeWhile (<= end) $ kBitNumbers start
  where start = 2^k - 1
        end   = sum $ fmap (2^) [n-k..n-1]

kBitNumbers :: Integer -> ST.Stream Integer
kBitNumbers = ST.iterate nextKBitNumber

main :: IO ()
main = do
  params <- getArgs
  let n = read $ params !! 0
      k = read $ params !! 1
  print $ length (combs n k)

My idea is that this should be easily parallelizable splitting this range into smaller parts. For example:

start :: Int -> Integer
start k = 2 ^ k - 1

end :: Int -> Int -> Integer
end n k = sum $ fmap (2 ^) [n-k..n-1]

splits :: Int -> Int -> Int -> [(Integer, Integer, Int)]
splits n k numSplits = fixedRanges ranges []
  where s                       = start k
        e                       = end n k
        step                    = (e-s) `div` (min (e-s) (toInteger numSplits))
        initSplits              = [s,s+step..e]
        ranges                  = zip initSplits (tail initSplits)
        fixedRanges [] acc      = acc
        fixedRanges [x] acc     = acc ++ [(fst x, e, k)]
        fixedRanges (x:xs) acc  = fixedRanges xs (acc ++ [(fst x, snd x, k)])

At this point, I would like to run each split in parallel, something like:

runSplit :: (Integer, Integer, Int) -> [Integer]
runSplit (start, end, k) = ST.takeWhile (<= end) $ kBitNumbers (fixStart start)
  where fixStart s
                   | popCount s == k = s
                   | otherwise       = fixStart $ s + 1

For pallalelization I am using the monad-par package:

import Control.Monad.Par
import System.Environment
import qualified Data.Set as S

main :: IO ()
main = do
  params <- getArgs
  let n               = read $ params !! 0
      k               = read $ params !! 1
      numTasks        = read $ params !! 2
      batches         = runPar $ parMap runSplit (splits n k numTasks)
      reducedNumbers  = foldl S.union S.empty $ fmap S.fromList batches
  print $ S.size reducedNumbers

The result is that the sequential version is way faster and it uses little memory, while the parallel version consumes a lot of memory and it is noticeable slower.

What might be the reasons causing this? Are threads a good approach for this problem? For example, every thread generates a (potentially large) list of integers and the main thread reduces the results; are threads expected to need much memory or are simply meant to produce simple results (i.e. only cpu-intensive computations)?

I compile my program with stack build --ghc-options -threaded --ghc-options -rtsopts --executable-profiling --library-profiling and run it with ./.stack-work/install/x86_64-osx/lts-6.1/7.10.3/bin/combinatorics 20 3 4 +RTS -pa -N4 -RTS for n=20, k=3 and numSplits=4. An example of the profiling report for the parallel version can be found here and for the sequential version here.

2
How are you running and testing the program?pdexter
Did you forget a parameter k on the combs function?rprospero
@pdexter have just updated my question with build/run commands and profiling reports for both parallel and sequential versionsjarandaf
What's your main for the sequential version? Thanks.ErikR
@d8d0d65b3f7cf42 tried to use threadscope with stack without much success (-eventlog flag seems not supported)jarandaf

2 Answers

2
votes

In your sequential version calling combs does not build up a list in memory since after length consumes an element it isn't needed anymore and is freed. Indeed, GHC may not even allocate storage for it.

For instance, this will take a while but won't consume a lot of memory:

main = print $ length [1..1000000000]  -- 1 billion

In your parallel version you are generating sub-lists, concatenating them together, building Sets, etc. and therefore the results of each sub-task have to be kept in memory.

A fairer comparison would be to have each parallel task compute the length of the k-bit numbers in its assigned range, and then add up the results. That way the k-bit numbers found by each parallel task wouldn't have to be kept in memory and would operate more like the sequential version.

Update

Here is an example of how to use parMap. Note: under 7.10.2 I've had mixed success getting the parallelism to fire - sometimes it does and sometimes it doesn't. (Figured it out - I was using -RTS -N2 instead of +RTS -N2.)

{-
compile with: ghc -O2 -threaded -rtsopts foo.hs

compare:

  time ./foo 26 +RTS -N1
  time ./foo 26 +RTS -N2
-}

import Data.Bits    
import Control.Parallel.Strategies
import System.Environment

nextKBitNumber :: Integer -> Integer
nextKBitNumber n
  | n == 0      = 0
  | otherwise   = ripple .|. ones
                       where smallest     = n .&. (-n)
                             ripple       = n + smallest
                             newSmallest  = ripple .&. (-ripple)
                             ones         = (newSmallest `div` smallest) `shiftR` 1 - 1

combs :: Int -> Int -> [Integer]
combs n k = takeWhile (<= end) $ iterate nextKBitNumber start
  where start = 2^k - 1
        end   = shift start (n-k)

main :: IO ()
main = do
  ( arg1 : _) <- getArgs
  let n = read arg1
  print $ parMap rseq (length . combs n) [1..n]
0
votes

good approaches for this problem

What do you mean by this problem? If it's how to write, analyze and tune a parallel Haskell program, then this is required background reading:

Simon Marlow: Parallel and Concurrent Programming in Haskell http://community.haskell.org/~simonmar/pcph/

in particular, Section 15 (Debugging, Tuning, ..)

Use threadscope! (a graphical viewer for thread profile information generated by the Glasgow Haskell compiler) https://hackage.haskell.org/package/threadscope