I am experimenting with the parallel forms of Guile Scheme and I have the following code:
(use-modules (srfi srfi-1)
(ice-9 pretty-print)
(ice-9 receive))
(define (busy-work limit)
(if (> limit 0)
(begin (sqrt (+ (expt limit limit) 1))
(busy-work (- limit 1)))
'done))
(define (busy-work-2 lst)
(cond [(null? lst) 'done]
[else
(expt (car lst) (car lst))
(busy-work-2 (cdr lst))]))
(define (time thunk)
(define starting-time (current-time))
(define res (thunk))
(define ending-time (current-time))
(display "elapsed time: ")
(display (- ending-time starting-time))
(display "s")
(newline)
res)
(define (partition-4 numbers)
(define (loop numbers rc0 rc1 rc2 rc3)
(cond [(null? numbers) (list (reverse rc0)
(reverse rc1)
(reverse rc2)
(reverse rc3))]
[else
(let* ([number (car numbers)]
[residue (remainder number 4)])
(cond [(= residue 0) (loop (cdr numbers)
(cons number rc0)
rc1
rc2
rc3)]
[(= residue 1) (loop (cdr numbers)
rc0
(cons number rc1)
rc2
rc3)]
[(= residue 2) (loop (cdr numbers)
rc0
rc1
(cons number rc2)
rc3)]
[(= residue 3) (loop (cdr numbers)
rc0
rc1
rc2
(cons number rc3))]))]))
(loop numbers '() '() '() '()))
(or in my experimenting repository at https://github.com/ZelphirKaltstahl/guile-scheme-tutorials/blob/5321470f8f3cbbdb7f64d4ed60e4b1eaf8d8f444/parallellism/utils.scm)
The 2 procedures busy-work
and busy-work-2
are pure number crunching, with lists of numbers, where no calculation depends on another, as far as I am aware. I know the time measurement might not be completely accurate.
However, consistently I do not get the expected speedup from using more threads (cores, as I can see in core usage in my CPU indicator).
Here are some examples, from which I would expect 2 threads to be twice as quickly done with the task as 1 core and 4 cores to be twice as fast as 2 cores. Well more or less at least, because I am splitting up the lists in a way, which should spread work more or less evenly.
Using 4 cores and parallel
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(parallel (busy-work-2 (car residue-classes))
(busy-work-2 (cadr residue-classes))
(busy-work-2 (caddr residue-classes))
(busy-work-2 (cadddr residue-classes))))))
This finishes in approximately 10s on my machine. Sometimes 9s sometimes 10s.
Using par-map
which uses 4 threads (cores)
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(par-map busy-work-2
residue-classes))))
This finishes in approximately 10s on my machine. Sometimes 9s sometimes 10s. Just like with parallel
.
Using n-par-map
with 4 threads (on my machine)
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(n-par-map (current-processor-count)
busy-work-2
residue-classes))))
Also 10s. Here the manual (https://www.gnu.org/software/guile/manual/html_node/Parallel-Forms.html) says:
Unlike those above, the functions described below take a number of threads as an argument. This makes them inherently non-portable since the specified number of threads may differ from the number of available CPU cores as returned by current-processor-count (see Processes). In addition, these functions create the specified number of threads when they are called and terminate them upon completion, which makes them quite expensive.
Therefore, they should be avoided.
While I find this explanation does not make 100% sense as it is (why would n-par-map
not use the same pre-created threads as parallel
, if there are sufficient of those? Like 4
as on my machine?), I do not see any great overhead and again I see it finishes in 10s approximately. My guess is, that the thread creation takes so little time, that it is simply not noticed compared to all the computation when number crunching.
Using n-par-map
with 2 threads (cores)
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(n-par-map 2
busy-work-2
residue-classes))))
Expectation: Might finish in 20s.
Result: This finishes in 12s.
Now of course I am thinking: "Well there must be some massive overhead in the runs with 4 cores!".
Question: But where does this overhead come from, when I am doing purely number crunching without inter-dependencies of any results? Does it use some shared memory so that memory access becomes a bottle neck?