Parallel Program: How to find the bottleneck (CPU bound threads)

Question

I have written a parallel program using OpenMP. It uses two threads because my laptop is dual core and the threads do a lot of matrix operations, so they are CPU bound. There is no data sharing among the threads. A single instance of the program runs quite fast. But when I run multiple instances of the same program simultaneously, the performance degrades. Here is a plot: running time vs number of parallel instances

The running time for a single instance (two threads) is 0.78 seconds. The running time for two instances (total of four threads) is 2.06, which is more than double of 0.78. After that, the running time increases in proportion with the number of instances (number of threads).

Here is the timing profile of one of the instances when multiple were run in parallel:

profile

Can someone offer insights into what could be going on? The profile shows that 50% of the time is being consumed by OpenMP. What does that mean?

What you observe is the context switching between two different processes each running on with 2 threads on both cpus. This scaling has nothing to do with a bottleneck in your application. — Bort
Yout can't state in general that matrix operations are CPU bound. Few actually are. Matrix multiplication, if done right, is CPU bound (as is LU and Cholesky factorization) but many others are not. — Z boson

soandos soandos · Accepted Answer · 2014-05-20T20:42:04

Similar to what @Bort said, you made the application multithreaded (two threads) because you have two cores.

This means that when only one instance of your program is running (ideally) it gets to use the whole CPU.

However, if two instances of the application are running, there are no more resources available. They will each take twice the time. Same for more instances.

You cannot fix this issue without also increasing the number of cores available for each instance (i.e. keeping it at 2 per instance, rather than a shrinking percentage).

Parallel Program: How to find the bottleneck (CPU bound threads)

1 Answers