Guile Scheme parallel forms speedup

Question

I am experimenting with the parallel forms of Guile Scheme and I have the following code:

(use-modules (srfi srfi-1)
             (ice-9 pretty-print)
             (ice-9 receive))

(define (busy-work limit)
  (if (> limit 0)
      (begin (sqrt (+ (expt limit limit) 1))
             (busy-work (- limit 1)))
      'done))

(define (busy-work-2 lst)
  (cond [(null? lst) 'done]
        [else
         (expt (car lst) (car lst))
         (busy-work-2 (cdr lst))]))

(define (time thunk)
  (define starting-time (current-time))
  (define res (thunk))
  (define ending-time (current-time))
  (display "elapsed time: ")
  (display (- ending-time starting-time))
  (display "s")
  (newline)
  res)


(define (partition-4 numbers)
  (define (loop numbers rc0 rc1 rc2 rc3)
    (cond [(null? numbers) (list (reverse rc0)
                                 (reverse rc1)
                                 (reverse rc2)
                                 (reverse rc3))]
          [else
           (let* ([number (car numbers)]
                  [residue (remainder number 4)])
             (cond [(= residue 0) (loop (cdr numbers)
                                        (cons number rc0)
                                        rc1
                                        rc2
                                        rc3)]
                   [(= residue 1) (loop (cdr numbers)
                                        rc0
                                        (cons number rc1)
                                        rc2
                                        rc3)]
                   [(= residue 2) (loop (cdr numbers)
                                        rc0
                                        rc1
                                        (cons number rc2)
                                        rc3)]
                   [(= residue 3) (loop (cdr numbers)
                                        rc0
                                        rc1
                                        rc2
                                        (cons number rc3))]))]))
  (loop numbers '() '() '() '()))

(or in my experimenting repository at https://github.com/ZelphirKaltstahl/guile-scheme-tutorials/blob/5321470f8f3cbbdb7f64d4ed60e4b1eaf8d8f444/parallellism/utils.scm)

The 2 procedures busy-work and busy-work-2 are pure number crunching, with lists of numbers, where no calculation depends on another, as far as I am aware. I know the time measurement might not be completely accurate.

However, consistently I do not get the expected speedup from using more threads (cores, as I can see in core usage in my CPU indicator).

Here are some examples, from which I would expect 2 threads to be twice as quickly done with the task as 1 core and 4 cores to be twice as fast as 2 cores. Well more or less at least, because I am splitting up the lists in a way, which should spread work more or less evenly.

Using 4 cores and `parallel`

(let ([residue-classes (partition-4 (iota 30000))])
  (time
   (lambda ()
     (parallel (busy-work-2 (car residue-classes))
               (busy-work-2 (cadr residue-classes))
               (busy-work-2 (caddr residue-classes))
               (busy-work-2 (cadddr residue-classes))))))

This finishes in approximately 10s on my machine. Sometimes 9s sometimes 10s.

Using `par-map` which uses 4 threads (cores)

(let ([residue-classes (partition-4 (iota 30000))])
  (time
   (lambda ()
     (par-map busy-work-2
              residue-classes))))

This finishes in approximately 10s on my machine. Sometimes 9s sometimes 10s. Just like with parallel.

Using `n-par-map` with 4 threads (on my machine)

(let ([residue-classes (partition-4 (iota 30000))])
  (time
   (lambda ()
     (n-par-map (current-processor-count)
                busy-work-2
                residue-classes))))

Also 10s. Here the manual (https://www.gnu.org/software/guile/manual/html_node/Parallel-Forms.html) says:

Unlike those above, the functions described below take a number of threads as an argument. This makes them inherently non-portable since the specified number of threads may differ from the number of available CPU cores as returned by current-processor-count (see Processes). In addition, these functions create the specified number of threads when they are called and terminate them upon completion, which makes them quite expensive.

Therefore, they should be avoided.

While I find this explanation does not make 100% sense as it is (why would n-par-map not use the same pre-created threads as parallel, if there are sufficient of those? Like 4 as on my machine?), I do not see any great overhead and again I see it finishes in 10s approximately. My guess is, that the thread creation takes so little time, that it is simply not noticed compared to all the computation when number crunching.

Using `n-par-map` with 2 threads (cores)

(let ([residue-classes (partition-4 (iota 30000))])
  (time
   (lambda ()
     (n-par-map 2
                busy-work-2
                residue-classes))))

Expectation: Might finish in 20s.

Result: This finishes in 12s.

Now of course I am thinking: "Well there must be some massive overhead in the runs with 4 cores!".

Question: But where does this overhead come from, when I am doing purely number crunching without inter-dependencies of any results? Does it use some shared memory so that memory access becomes a bottle neck?

Chris Vine Chris Vine · Accepted Answer · 2018-05-16T10:15:41

You are probably using a machine with two physical cores which are hyperthreaded, so that 4 cpus are reported. What it shows is that this workload does not suit hyperthreads.

I get a similar result on a machine with two hyperthreaded physical cores. However with a machine with 4 physical cores I get 9 secs using all 4 cores, and 16 secs using only 2 cores, which is more along the lines of what you were expecting.

Guile Scheme parallel forms speedup

Using 4 cores and parallel

Using par-map which uses 4 threads (cores)

Using n-par-map with 4 threads (on my machine)

Using n-par-map with 2 threads (cores)

1 Answers

Using 4 cores and `parallel`

Using `par-map` which uses 4 threads (cores)

Using `n-par-map` with 4 threads (on my machine)

Using `n-par-map` with 2 threads (cores)