I am trying to assemble a big vector using multiple threads. Each thread is reading through its own thread vector and writing into a specific part of the big vector (indices are contiguous).
The total number of entries is a fixed number N, each thread will write N/numberOfThreads entries to the big vector. I did the following experiment:
//each vector contains the data that a particular thread needs to process
//and has the same length = N/numberOfThreads;
vector<vector<double> > threadVectors;
//the big vector that each thread needs to write into
vector<double> totalVector(N);
//initialize threadVectors ...
#pramga omp parallel
{
int threadId = omp_get_thread_num();
vector<double>& threadVector = threadVectors[threadId];
int globalStartId = threadId * threadVector.size();
std::copy(threadVector.begin(), threadVector.end(),
totalVector.begin() + globalStartId);
}
I am running timing the parallel part for 10 repeats and N = 1e7. After I experimenting with 1-24 thread, I get the following speed up:
number of threads, time, speed up w.r.t to single thread
1 : 0.1797 speedup 0.99
2 : 0.1362 speedup 1.31
3 : 0.1430 speedup 1.25
4 : 0.1249 speedup 1.43
5 : 0.1314 speedup 1.36
6 : 0.1446 speedup 1.23
7 : 0.1343 speedup 1.33
8 : 0.1414 speedup 1.26
9 : 0.1370 speedup 1.30
10 : 0.1387 speedup 1.28
11 : 0.1434 speedup 1.24
12 : 0.1344 speedup 1.33
13 : 0.1299 speedup 1.37
14 : 0.1303 speedup 1.37
16 : 0.1362 speedup 1.31
18 : 0.1341 speedup 1.33
20 : 0.1384 speedup 1.29
22 : 0.1319 speedup 1.35
23 : 0.1303 speedup 1.37
24 : 0.1298 speedup 1.37
the machine is 12 cores with hyperthreading (24 threads). Looks like the speedup is quite poor and the algo doesn't involve any race or lock.
any one know the problem?