5
votes

I have a simple program that does some Monte Carlo Algorithmn. One iteration with the algorithmn is without side effects, so I should be able to run it with multiple threads. So this is the relevant part of my whole program, which is written in C++11:

void task(unsigned int max_iter, std::vector<unsigned int> *results, std::vector<unsigned int>::iterator iterator) {
    for (unsigned int n = 0; n < max_iter; ++n) {
        nume::Album album(535);
        unsigned int steps = album.fill_up();
        *iterator = steps;
        ++iterator;
    }
}

void aufgabe2() {
    std::cout << "\nAufgabe 2\n";

    unsigned int max_iter = 10000;

    unsigned int thread_count = 4;

    std::vector<std::thread> threads(thread_count);
    std::vector<unsigned int> results(max_iter);

    std::cout << "Computing with " << thread_count << " threads" << std::endl;

    int i = 0;
    for (std::thread &thread: threads) {
        std::vector<unsigned int>::iterator start = results.begin() + max_iter/thread_count * i;
        thread = std::thread(task, max_iter/thread_count, &results, start);
        i++;
    }

    for (std::thread &thread: threads) {
        thread.join();
    }

    std::ofstream out;
    out.open("out-2a.csv");
    for (unsigned int count: results) {
        out << count << std::endl;
    }
    out.close();

    std::cout << "Siehe Plot" << std::endl;
}

The puzzling thing is that it gets slower the more threads I add. With 4 threads, I get this:

real    0m5.691s
user    0m3.784s
sys     0m10.844s

Whereas with a single thread:

real    0m1.145s
user    0m0.816s
sys     0m0.320s

I realize that moving data between the CPU cores might add overhead, but the vector should be declared at startup, and not be modified in the middle. Is there any particular reason for this to be slower on multiple cores?

My system is an i5-2550M, which has 4 cores (2 + Hyperthreading) and I use g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3

Update

I saw that using no threads (1), it will have a lot of user load, whereas with threads (2), it will have more kernel than user load:

10K Runs:

http://wstaw.org/m/2013/05/08/stats3.png

100K Runs:

http://wstaw.org/m/2013/05/08/Auswahl_001.png

Current main.cpp

With 100K runs, I get the following:

No threads at all:

real    0m28.705s
user    0m28.468s
sys     0m0.112s

A thread for each part of the program. Those parts do not even use the same memory, so I concurrency for the same container should be out as well. But it takes way longer:

real    2m50.609s
user    2m45.664s
sys     4m35.772s

So although the three main parts take up 300 % of my CPU, they take 6 times as long.

With 1M runs, it took real 4m45 to do. I ran 1M previously, and it took at least real 20m, if not even real 30m.

2
10000 is really small... try a bigger number.UmNyobe
Probably context switch overhead is dominating the time required to carry out the task here. As suggested, add a few zeros to that 10000...Andy Prowl
Creating a thread has overhead too. Let the task do a simple return and see how much of those numbers was actual computation. Also try not creating threads at all (just run the task function from the current one), it should get even faster. 10K iterations are probably nothing compared to what OS has to do to launch a thread.hamstergene
I bet std::rand() used in name::Album does locking. Please replace it with 7 (for testing purposes) and tell if timings change.zch
This is a classic example of lock contention. Threads are all waiting on each other for a lock to be released. The overhead is context switching as the threads spend most of their time waking up and going back to sleep. Agree the contended lock may be in std:rand. But OP may have global locks elsewhere also.Rafael Baptista

2 Answers

5
votes

Evaluated your current main.cpp at GitHub. In addition to comments provided above:

  1. Yes, rand() is not thread-safe so there could be worth to pre-fill some array with random values before running your multi-threading business logic (that way, you decrease amount of possible locks). The same about memory allocation if you plan to do some heap activity (do pre-allocation before multithreading or use custom per-thread allocator).
  2. Don't forget about other processes. If you plan to use 4 threads on 4 cores, it means you'll compete with other software (at least OS routines) for CPU resources.
  3. File output is a big locker player. You do "<<" operator on each loop iteration and it costs you a lot (I remember one funny case in my past: doing a log ouput fixed one multi-threading bug, indirectly. Because generic logger is lock-driven, it is some kind of sync primitive, be aware!).
  4. Finally, there is no any kind of warranty that multi-threaded app could be faster than single-thread. There is a bunch of CPU-specific, environment-specific, etc aspects.
1
votes

The vector object results is shared by all the threads created, so even if your problem is an embarrassingly parallel, due to the shared object, there is a contention not to mention cache misses(I am not good enough to explain about caches on modern architectures). probably you should have n result vectors for your n threads and finally merge the result. That would speed it up, I guess.

Another tip to mention is use std::async whenever possible instead of thread. It handles thread allocation and other low level messes. I read it from Scott Mayer's effective c++11 book. However, by using threads, you can set threads affinity to particular core. So, if your processor supports 8 threads, you can create 8 threads and assign each thread to each core at least on linux.