8
votes

Lets assume we have fixed amount of calculation work, without blocking, sleeping, i/o-waiting. The work can be parallelized very well - it consists of 100M small and independent calculation tasks.

What is faster for 4-core CPU - to run 4 threads or... lets say 50? Why second variant should be slover and how much slover?

As i assume: when you run 4 heavy threads on 4-core CPU without another CPU-consuming processes/threads, scheduler is allowed to not move the threads between cores at all; it has no reason to do that in this situation. Core0 (main CPU) will be responsible for executing interruption handler for hardware timer 250 times per second (basic Linux configuration) and other hardware interruption handlers, but another cores may not feel any worries.

What is the cost of context switching? The time for store and restore CPU registers for different context? What about caches, pipelines and various code-prediction things inside CPU? Can we say that each time we switch context, we hurt caches, pipelines and some code-decoding facilities in CPU? So more threads executing on a single core, less work they can do together in comparison to their serial execution?

Question about caches and another hardware optimization in multithreading environment is the interesting question for me now.

5
The answer is heavily application-, system-, and machine- specific. But it is probably bigger than 4, but much less than 50 threads. Did you try to measure with 4, 6, 8, 10 threads?Basile Starynkevitch
4 threads (or 8 w/ Hyper threading). Less data portions. better cache properties.bestsss

5 Answers

12
votes

As @Baile mentions in the comments, this is highly application, system, environment-specific.

And as such, I'm not going to take the hard-line approach of mentioning exactly 1 thread for each core. (or 2 threads/core in the case of Hyperthreading)

As an experienced shared-memory programmer, I have seen from my experience that the optimal # of threads (for a 4 core machine) can range anywhere from 1 to 64+.

Now I will enumerate the situations that can cause this range:

Optimal Threads < # of Cores

In certain tasks that are very fine-grained paralleled (such as small FFTs), the overhead of threading is the dominant performance factor. In some cases, it's it not helpful to parallelize at all. In some cases, you get speedup with 2 threads, but backwards scaling at 4 threads.

Another issue is resource contention. Even if you have a highly parallelizable task that can easily split across 4 cores/threads, you may be bottlenecked by memory bandwidth and cache effects. So often, you find that 2 threads will be just as fast as 4 threads. (as if often the case with very large FFTs)

Optimal Threads = # of Cores

This is the optimal case. No need to explain here - one thread per core. Most embarrassingly parallel applications that are not memory or I/O bound fit right here.

Optimal Threads > # of Cores

This is where it gets interesting... very interesting. Have you heard about load-imbalance? How about over-decomposition and work-stealing?

Many parallelizable applications are irregular - meaning that the tasks do not split into sub-tasks of equal size. So if you may end up splitting a large task into 4 unequal sizes, assign them to 4 threads and run them on 4 cores... the result? Poor parallel performance because 1 thread happened to get 10x more work than the other threads.

A common solution here is to over-decompose the task into many sub-tasks. You can either create threads for each one of them (so now you get threads >> cores). Or you can use some sort of task-scheduler with a fixed number of threads. Not all tasks are suited for both, so quite often, the approach of over-decomposing a task to 8 or 16 threads for a 4-core machine gives optimal results.


Although spawning more threads can lead to better load-balance, the overhead builds up. So there's typically an optimal point somewhere. I've seen as high as 64 threads on 4 cores. But as mentioned, it's highly application specific. And you need to experiment.


EDIT : Expanding answer to more directly answer the question...

What is the cost of context switching? The time for store and restore CPU registers for different context?

This is very dependent on the environment - and is somewhat difficult to measure directly.
Short answer: Very Expensive This might be a good read.

What about caches, pipelines and various code-prediction things inside CPU? Can we say that each time we switch context, we hurt caches, pipelines and some code-decoding facilities in CPU?

Short answer: Yes When you context switch out, you likely flush your pipeline and mess up all the predictors. Same with caches. The new thread is likely to replace the cache with new data.

There's a catch though. In some applications where the threads share the same data, it's possible that one thread could potentially "warm" the cache for another incoming thread or another thread on a different core sharing the same cache. (Although rare, I've seen this happen before on one of my NUMA machines - superlinear speedup: 17.6x across 16 cores!?!?!)

So more threads executing on a single core, less work they can do together in comparison to their serial execution?

Depends, depends... Hyperthreading aside, there will definitely be overhead. But I've read a paper where someone used a second thread to prefetch for the main thread... Yes it's crazy...

1
votes

Creating 50 threads will actually hurt performance, not improve it. It just doesn't make any sense.

Ideally you should make the 4 threads, not more, not less. There will be some overhead because of context switching, but that is unavoidable. The OS/services/other applications threads should too be executed. But nowadays with so powerful and lighting-fast CPUs this is of no concern since those OS threads will only take less that 2 % of the CPU's time. Almost all of them will be in blocked state while your program is running.

You might think that, since performance is of critical importance, you should code those small critical areas in low-level assembly language. Modern programming languages allow this.

But seriously... compilers and, in case of Java, the JVM, will optimize those portions so well that it just isn't worth it (unless you actually want to exercise something like this). Instead of your calculations finishing in 100 seconds, they'll finish in 97 or 98. The question you must ask yourself is: is it worth all those hours of coding and debugging ?

You asked about the time cost of context switching. These days, these are extremely low. Look at modern day dual-core CPUs that run Windows 7 for example. If you start an Apache web server on that machine and a MySQL database server, you will easily go over 800 threads. The machine just doesn't feel it. To see how low this cost is, read here: How to estimate the thread context switching overhead? . To spare you the searching/reading part: context switching can be done hundreds of thousands of times per second.

1
votes

4 threads are faster if you can program your 40 tasks switching better than Operating System does.

0
votes

If you can use 4 threads, use them. There's no way 50 will go faster than 4 on a 4-core machine. All you get is more overhead.

Of course, you're describing an ideal non-real-world situation, so whatever you are actually building, you'll need to measure in order to understand how the performance is affected.

0
votes

There is Hyperthreading technology which can handle more that one thread per CPU, but it is hardly dependent on type of calculation you want to do. Consider using of GPU or very low assembly language to achieve maximum power.