I have a simple program that does some Monte Carlo Algorithmn. One iteration with the algorithmn is without side effects, so I should be able to run it with multiple threads. So this is the relevant part of my whole program, which is written in C++11:
void task(unsigned int max_iter, std::vector<unsigned int> *results, std::vector<unsigned int>::iterator iterator) {
for (unsigned int n = 0; n < max_iter; ++n) {
nume::Album album(535);
unsigned int steps = album.fill_up();
*iterator = steps;
++iterator;
}
}
void aufgabe2() {
std::cout << "\nAufgabe 2\n";
unsigned int max_iter = 10000;
unsigned int thread_count = 4;
std::vector<std::thread> threads(thread_count);
std::vector<unsigned int> results(max_iter);
std::cout << "Computing with " << thread_count << " threads" << std::endl;
int i = 0;
for (std::thread &thread: threads) {
std::vector<unsigned int>::iterator start = results.begin() + max_iter/thread_count * i;
thread = std::thread(task, max_iter/thread_count, &results, start);
i++;
}
for (std::thread &thread: threads) {
thread.join();
}
std::ofstream out;
out.open("out-2a.csv");
for (unsigned int count: results) {
out << count << std::endl;
}
out.close();
std::cout << "Siehe Plot" << std::endl;
}
The puzzling thing is that it gets slower the more threads I add. With 4 threads, I get this:
real 0m5.691s
user 0m3.784s
sys 0m10.844s
Whereas with a single thread:
real 0m1.145s
user 0m0.816s
sys 0m0.320s
I realize that moving data between the CPU cores might add overhead, but the vector
should be declared at startup, and not be modified in the middle. Is there any particular reason for this to be slower on multiple cores?
My system is an i5-2550M, which has 4 cores (2 + Hyperthreading) and I use g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3
Update
I saw that using no threads (1), it will have a lot of user load, whereas with threads (2), it will have more kernel than user load:
10K Runs:
http://wstaw.org/m/2013/05/08/stats3.png
100K Runs:
http://wstaw.org/m/2013/05/08/Auswahl_001.png
With 100K runs, I get the following:
No threads at all:
real 0m28.705s
user 0m28.468s
sys 0m0.112s
A thread for each part of the program. Those parts do not even use the same memory, so I concurrency for the same container should be out as well. But it takes way longer:
real 2m50.609s
user 2m45.664s
sys 4m35.772s
So although the three main parts take up 300 % of my CPU, they take 6 times as long.
With 1M runs, it took real 4m45
to do. I ran 1M previously, and it took at least real 20m
, if not even real 30m
.
10000
is really small... try a bigger number. – UmNyobe10000
... – Andy Prowlreturn
and see how much of those numbers was actual computation. Also try not creating threads at all (just run the task function from the current one), it should get even faster. 10K iterations are probably nothing compared to what OS has to do to launch a thread. – hamstergenestd::rand()
used inname::Album
does locking. Please replace it with7
(for testing purposes) and tell if timings change. – zch