With every parallel processing task, there is some amount of overhead due to task coordination. pmap
applies the mapping function to each element individually in a different thread. As the lazy sequence returned by pmap
is consumed, the consumer thread must coordinate with the producer threads. The way pmap
is defined, this overhead occurs for each and every element produced.
Considering this, when you use pmap
to compute a simple function (such as squaring a number, as in your example), the time it takes for the threads to coordinate their activities swamps the time it takes to actually compute the value. As the docstring says, pmap
is "only useful for computationally intensive functions where the time of f dominates the coordination overhead" (empasis added). In these cases, pmap
will take longer than map
regardless of how many cores you have.
To actually see a benefit from pmap
, you must choose a "harder" problem. In some cases, this may be as simple as partitioning the input sequence into chunks. Then the sequence of chunks can be processed with pmap
and then run through concat
to get the final output.
For example:
(defn chunked-pmap [f partition-size coll]
(->> coll ; Start with original collection.
(partition-all partition-size) ; Partition it into chunks.
(pmap (comp doall ; Map f over each chunk,
(partial map f))) ; and use doall to force it to be
; realized in the worker thread.
(apply concat))) ; Concatenate the chunked results
; to form the return value.
However, there is also an overhead for partitioning the sequence and concatenating the chunks at the end. For example, at least on my machine, chunked-pmap
still under-performed map
by a significant amount for your example. Still, it may be effective for some functions.
Another way to improve the effectiveness of pmap
is to partition the work at a different place in the overall algorithm. For example, suppose we were interested in calculating the euclidean distance between pairs of points. While parallelizing the square function has proven to be ineffective, we might have some luck parallelizing the entire distance function. Realistically, we would want to partition the task at an even higher level, but that is the gist of it.
In short, the performance of parallel algorithms are sensitive to the manner in which the task is partitioned, and you have chosen a level that is too granular for your test.