poor scaling when assembling vector using different threads in openmp

Question

I am trying to assemble a big vector using multiple threads. Each thread is reading through its own thread vector and writing into a specific part of the big vector (indices are contiguous).

The total number of entries is a fixed number N, each thread will write N/numberOfThreads entries to the big vector. I did the following experiment:

//each vector contains the data that a particular thread needs to process
//and has the same length = N/numberOfThreads;
vector<vector<double> > threadVectors; 
//the big vector that each thread needs to write into
vector<double> totalVector(N); 

//initialize threadVectors ...

#pramga omp parallel
{
    int threadId = omp_get_thread_num();
    vector<double>& threadVector = threadVectors[threadId];
    int globalStartId = threadId * threadVector.size();
    std::copy(threadVector.begin(), threadVector.end(),
        totalVector.begin() + globalStartId);
}

I am running timing the parallel part for 10 repeats and N = 1e7. After I experimenting with 1-24 thread, I get the following speed up:

number of threads, time, speed up w.r.t to single thread

1 : 0.1797 speedup 0.99

2 : 0.1362 speedup 1.31

3 : 0.1430 speedup 1.25

4 : 0.1249 speedup 1.43

5 : 0.1314 speedup 1.36

6 : 0.1446 speedup 1.23

7 : 0.1343 speedup 1.33

8 : 0.1414 speedup 1.26

9 : 0.1370 speedup 1.30

10 : 0.1387 speedup 1.28

11 : 0.1434 speedup 1.24

12 : 0.1344 speedup 1.33

13 : 0.1299 speedup 1.37

14 : 0.1303 speedup 1.37

16 : 0.1362 speedup 1.31

18 : 0.1341 speedup 1.33

20 : 0.1384 speedup 1.29

22 : 0.1319 speedup 1.35

23 : 0.1303 speedup 1.37

24 : 0.1298 speedup 1.37

the machine is 12 cores with hyperthreading (24 threads). Looks like the speedup is quite poor and the algo doesn't involve any race or lock.

any one know the problem?

Why exactly is your processor? Are you sure you don't have a two socket system each with six cores? Probably, not because then your scaling would be at least twice. — Z boson
Please note that you are using #pragma omp parallel which actually executes the code in parallel. However, in your case, every thread is actually calculating the same and you're not dividing the work among them. You'd need a #pragma omp parallel for, for instance. — Harald

1201ProgramAlarm 1201ProgramAlarm · Accepted Answer · 2015-10-12T01:21:22

Because your threaded task is extremely memory intensive, copying data from one block of memory to another, the performance is memory bound. This is not something that scales well. Adding more cores won't help any because they're all waiting for data from main memory. This is why with your results you get a slight improvement with two threads but no additional improvement after that.

The only way to make it run faster is to speed up your memory, but that's a hardware problem.

poor scaling when assembling vector using different threads in openmp

1 Answers