Why does this program run faster when it's allocated fewer threads?

Question

I have a fairly simple Go program designed to compute random Fibonacci numbers to test some strange behavior I observed in a worker pool I wrote. When I allocate one thread, the program finishes in 1.78s. When I allocate 4, it finishes in 9.88s.

The code is as follows:

var workerWG sync.WaitGroup

func worker(fibNum chan int) {
    for {
        var tgt = <-fibNum
        workerWG.Add(1)
        var a, b float64 = 0, 1
        for i := 0; i < tgt; i++ {
            a, b = a+b, a
        }
        workerWG.Done()
    }
}

func main() {
    rand.Seed(time.Now().UnixNano())
    runtime.GOMAXPROCS(1) // LINE IN QUESTION

    var fibNum = make(chan int)

    for i := 0; i < 4; i++ {
        go worker(fibNum)
    }
    for i := 0; i < 500000; i++ {
        fibNum <- rand.Intn(1000)
    }
    workerWG.Wait()
}

If I replace runtime.GOMAXPROCS(1) with 4, the program takes four times as long to run.

What's going on here? Why does adding more available threads to a worker pool slow the entire pool down?

My personal theory is that it has to do with the processing time of the worker being less than the overhead of thread management, but I'm not sure. My reservation is caused by the following test:

When I replace the worker function with the following code:

for {
    <-fibNum
    time.Sleep(500 * time.Millisecond)
}

both one available thread and four available threads take the same amount of time.

Most likely there is some overhead for using a worker thread, and the job done by each worker thread is so trivial that the overhead takes more time than the actual job. There may also be overhead for synchronizing between threads, and with one thread only this synchronisation doesn't happen at all. — gnasher729
I don't know anything about Go, but it sounds like a synchronization issue, that your worker threads don't really work in parallel but in serial. — Some programmer dude
@JoachimPileborg If that were the case, shouldn't it take the same amount of time no matter how many available threads were allocated? — user1131435
you need remove the waitGroup, or goroutine schedule and synchronization problem is mixed, making the testing too complex. — Jiang YD
Your way of using the wait group is weird. Here's how it's normally done. — thwd

thwd thwd · Accepted Answer · 2015-09-16T09:37:30

I revised your program to look like the following:

package main

import (
    "math/rand"
    "runtime"
    "sync"
    "time"
)

var workerWG sync.WaitGroup

func worker(fibNum chan int) {
    for tgt := range fibNum {
        var a, b float64 = 0, 1
        for i := 0; i < tgt; i++ {
            a, b = a+b, a
        }
    }
    workerWG.Done()
}

func main() {
    rand.Seed(time.Now().UnixNano())
    runtime.GOMAXPROCS(1) // LINE IN QUESTION

    var fibNum = make(chan int)

    for i := 0; i < 4; i++ {
        go worker(fibNum)
        workerWG.Add(1)
    }
    for i := 0; i < 500000; i++ {
        fibNum <- rand.Intn(100000)
    }
    close(fibNum)
    workerWG.Wait()
}

I cleaned up the wait group usage.
I changed rand.Intn(1000) to rand.Intn(100000)

On my machine that produces:

$ time go run threading.go (GOMAXPROCS=1)

real    0m20.934s
user    0m20.932s
sys 0m0.012s

$ time go run threading.go (GOMAXPROCS=8)

real    0m10.634s
user    0m44.184s
sys 0m1.928s

This means that in your original code, the work performed vs synchronization (channel read/write) was negligible. The slowdown came from having to synchronize across threads instead of one and only perform a very small amount of work inbetween.

In essence, synchronization is expensive compared to calculating fibonacci numbers up to 1000. This is why people tend to discourage micro-benchmarks. Upping that number gives a better perspective. But an even better idea is to benchmark actual work being done i.e. including IO, syscalls, processing, crunching, writing output, formatting, etc.

Edit: As an experiment, I upped the number of workers to 8 with GOMAXPROCS set to 8 and the result was:

$ time go run threading.go 

real    0m4.971s
user    0m35.692s
sys 0m0.044s

Why does this program run faster when it's allocated fewer threads?

3 Answers