7
votes

I tested the clojure function map and pmap as show below, in cojure REPL. It make me confused: why the parallel pmap is slower than map?

user=> (def lg (range 1 10000000))
user=> (time (def rs (doall (pmap #(* % %) lg))))

"Elapsed time: **125739.056** msecs"

# -------------------------------------------------------
user=> (def lg (range 1 10000000))
user=> (time (def rs (doall (map #(* % %) lg))))

"Elapsed time: **5804.485** msecs"

**PS: the machine has 8 cores**
3

3 Answers

17
votes

With every parallel processing task, there is some amount of overhead due to task coordination. pmap applies the mapping function to each element individually in a different thread. As the lazy sequence returned by pmap is consumed, the consumer thread must coordinate with the producer threads. The way pmap is defined, this overhead occurs for each and every element produced.

Considering this, when you use pmap to compute a simple function (such as squaring a number, as in your example), the time it takes for the threads to coordinate their activities swamps the time it takes to actually compute the value. As the docstring says, pmap is "only useful for computationally intensive functions where the time of f dominates the coordination overhead" (empasis added). In these cases, pmap will take longer than map regardless of how many cores you have.

To actually see a benefit from pmap, you must choose a "harder" problem. In some cases, this may be as simple as partitioning the input sequence into chunks. Then the sequence of chunks can be processed with pmap and then run through concat to get the final output.

For example:

(defn chunked-pmap [f partition-size coll]
  (->> coll                           ; Start with original collection.

       (partition-all partition-size) ; Partition it into chunks.

       (pmap (comp doall              ; Map f over each chunk,
                   (partial map f)))  ; and use doall to force it to be
                                      ; realized in the worker thread.

       (apply concat)))               ; Concatenate the chunked results
                                      ; to form the return value.

However, there is also an overhead for partitioning the sequence and concatenating the chunks at the end. For example, at least on my machine, chunked-pmap still under-performed map by a significant amount for your example. Still, it may be effective for some functions.

Another way to improve the effectiveness of pmap is to partition the work at a different place in the overall algorithm. For example, suppose we were interested in calculating the euclidean distance between pairs of points. While parallelizing the square function has proven to be ineffective, we might have some luck parallelizing the entire distance function. Realistically, we would want to partition the task at an even higher level, but that is the gist of it.

In short, the performance of parallel algorithms are sensitive to the manner in which the task is partitioned, and you have chosen a level that is too granular for your test.

3
votes

Rörd is correct, there's a significant overhead in using pmap. consider using reducers instead:

(def l (range 10000000))

(time (def a (doall (pmap #(* % %) l))))
"Elapsed time: 14674.415781 msecs"

(time (def a (doall (map #(* % %) l))))
"Elapsed time: 1119.107447 msecs"

(time (def a (doall (into [] (r/map #(* % %) l)))))
"Elapsed time: 1049.754652 msecs"
2
votes

There is some overhead for creating the threads, splitting the workload between them and reassembling the results. You will need a function that runs significantly longer than #(* % %) to see a speed improvement from pmap (and it does of course also depend on the number of cores of your CPU which you didn't specify in your question).