4
votes

I have a fairly simple Go program designed to compute random Fibonacci numbers to test some strange behavior I observed in a worker pool I wrote. When I allocate one thread, the program finishes in 1.78s. When I allocate 4, it finishes in 9.88s.

The code is as follows:

var workerWG sync.WaitGroup

func worker(fibNum chan int) {
    for {
        var tgt = <-fibNum
        workerWG.Add(1)
        var a, b float64 = 0, 1
        for i := 0; i < tgt; i++ {
            a, b = a+b, a
        }
        workerWG.Done()
    }
}

func main() {
    rand.Seed(time.Now().UnixNano())
    runtime.GOMAXPROCS(1) // LINE IN QUESTION

    var fibNum = make(chan int)

    for i := 0; i < 4; i++ {
        go worker(fibNum)
    }
    for i := 0; i < 500000; i++ {
        fibNum <- rand.Intn(1000)
    }
    workerWG.Wait()
}

If I replace runtime.GOMAXPROCS(1) with 4, the program takes four times as long to run.

What's going on here? Why does adding more available threads to a worker pool slow the entire pool down?

My personal theory is that it has to do with the processing time of the worker being less than the overhead of thread management, but I'm not sure. My reservation is caused by the following test:

When I replace the worker function with the following code:

for {
    <-fibNum
    time.Sleep(500 * time.Millisecond)
}

both one available thread and four available threads take the same amount of time.

3
Most likely there is some overhead for using a worker thread, and the job done by each worker thread is so trivial that the overhead takes more time than the actual job. There may also be overhead for synchronizing between threads, and with one thread only this synchronisation doesn't happen at all.gnasher729
I don't know anything about Go, but it sounds like a synchronization issue, that your worker threads don't really work in parallel but in serial.Some programmer dude
@JoachimPileborg If that were the case, shouldn't it take the same amount of time no matter how many available threads were allocated?user1131435
you need remove the waitGroup, or goroutine schedule and synchronization problem is mixed, making the testing too complex.Jiang YD
Your way of using the wait group is weird. Here's how it's normally done.thwd

3 Answers

4
votes

I revised your program to look like the following:

package main

import (
    "math/rand"
    "runtime"
    "sync"
    "time"
)

var workerWG sync.WaitGroup

func worker(fibNum chan int) {
    for tgt := range fibNum {
        var a, b float64 = 0, 1
        for i := 0; i < tgt; i++ {
            a, b = a+b, a
        }
    }
    workerWG.Done()
}

func main() {
    rand.Seed(time.Now().UnixNano())
    runtime.GOMAXPROCS(1) // LINE IN QUESTION

    var fibNum = make(chan int)

    for i := 0; i < 4; i++ {
        go worker(fibNum)
        workerWG.Add(1)
    }
    for i := 0; i < 500000; i++ {
        fibNum <- rand.Intn(100000)
    }
    close(fibNum)
    workerWG.Wait()
}
  • I cleaned up the wait group usage.
  • I changed rand.Intn(1000) to rand.Intn(100000)

On my machine that produces:

$ time go run threading.go (GOMAXPROCS=1)

real    0m20.934s
user    0m20.932s
sys 0m0.012s

$ time go run threading.go (GOMAXPROCS=8)

real    0m10.634s
user    0m44.184s
sys 0m1.928s

This means that in your original code, the work performed vs synchronization (channel read/write) was negligible. The slowdown came from having to synchronize across threads instead of one and only perform a very small amount of work inbetween.

In essence, synchronization is expensive compared to calculating fibonacci numbers up to 1000. This is why people tend to discourage micro-benchmarks. Upping that number gives a better perspective. But an even better idea is to benchmark actual work being done i.e. including IO, syscalls, processing, crunching, writing output, formatting, etc.

Edit: As an experiment, I upped the number of workers to 8 with GOMAXPROCS set to 8 and the result was:

$ time go run threading.go 

real    0m4.971s
user    0m35.692s
sys 0m0.044s
1
votes

The code written by @thwd is correct and idiomatic Go.

Your code was being serialized due to the atomic nature of sync.WaitGroup. Both workerWG.Add(1) and workerWG.Done() will block until they're able to atomically update the internal counter.

  • Since the workload is between 0 and 1000 recursive calls, the bottleneck of a single core was enough to keep data races on the waitgroup counter to a minimum.
  • On multiple cores, the processor spends a lot of time spinning to fix the collisions of waitgroup calls. Add that to the fact that the waitgroup counter is kept on one core and you now have added communication between cores (taking up even more cycles).

A couple hints for simplifying code:

  • For a small, set number of goroutines, a complete channel (chan struct{} to avoid allocations) is cheaper to use.
  • Use the send channel close as a kill signal for goroutines and have them signal that they've exited (waitgroup or channel). Then, close to complete channel to free them up for the GC.
  • If you need a waitgroup, aggressively minimize the number of calls to it. Those calls must be internally serialized, so extra calls forces added synchronization.
-1
votes

Your main computation routine in worker does not allow the scheduler to run. Calling the scheduler manually like

    for i := 0; i < tgt; i++ {
        a, b = a+b, a
        if i%300 == 0 {
            runtime.Gosched()
        }
    }

Reduces wall clock by 30% when switching from one to two threads.

Such artificial microbenchmarks are really hard to get right.