4
votes

I've been calling this in OpenMP

#pragma omp parallel for num_threads(totalThreads)
for(unsigned i=0; i<totalThreads; i++)
{
workOnTheseEdges(startIndex[i], endIndex[i]);
}

And this in C++11 std::threads (I believe those are just pthreads)

vector<thread> threads;
for(unsigned i=0; i<totalThreads; i++)
{
threads.push_back(thread(workOnTheseEdges,startIndex[i], endIndex[i])); 
}
for (auto& thread : threads)
{
 thread.join();
}

But, the OpenMP implementation is 2x the speed--Faster! I would have expected C++11 threads to be faster, as they are more low-level. Note: The code above is being called not just once, but probably 10,000 times in a loop, so maybe that has something to do with it?

Edit: for clarification, in practice, I either use the OpenMP or the C++11 version--not both. When I am using the OpenMP code, it takes 45 seconds and when I am using the the C++11, it takes 100 seconds.

2
Presumably OpenMP doesn't generate thousands of threads... - Kerrek SB
My crystal ball fails to reveal what the value of totalThreads is, how many cores/HW threads your CPU has, what the size of startIndex is and how much time it takes to execute workOnTheseEdges() once. - Hristo Iliev
They aren't doing the same thing. The OpenMP version is distributing 10,000 tasks over 16 threads. The C++11 version is running 10,000 tasks on 10,000 threads. Threads are expensive, and having more threads than cores is even more expensive. You can't just throw new threads at every small task (unless you happen to have 10,000 or so cores to run them on). The OpenMP version is taking care of this for you. - adpalumbo
@user2588666: You said "the above code is called in a loop". Each and every time it's called, the std::thread version creates totalThreads new threads, but the OpenMP is reusing the same 16 each time the loop executes. - Mooing Duck
@user2588666: Visual studio implements std::async to reuse the same threads. Other than that you'd have to manage the 16 threads yourself (which is pretty easy for your case. Make the threads vector static: coliru.stacked-crooked.com/a/3fdad471c0c26d41) - Mooing Duck

2 Answers

4
votes

Where does totalThreads come from in your OpenMP version? I bet it's not startIndex.size().

The OpenMP version queues the requests onto totalThreads worker threads. It looks like the C++11 version creates, startIndex.size() threads, which involves a ridiculous amount of overhead if that's a big number.

3
votes

Consider the following code. The OpenMP version runs in 0 seconds while the C++11 version runs in 50 seconds. This is not due to the function being doNothing, and it's not due to vector being within the loop. As you can imagine, the c++11 threads are created and then destroyed in each iteration. On the other hand, OpenMP actually implements threadpools. It's not in the standard, but it's in Intel's and AMD's implementations.

for(int j=1; j<100000; ++j)
{
    if(algorithmToRun == 1)
    {
        vector<thread> threads;
        for(int i=0; i<16; i++)
        {
            threads.push_back(thread(doNothing));
        }
        for(auto& thread : threads) thread.join();
    }
    else if(algorithmToRun == 2)
    {
        #pragma omp parallel for num_threads(16)
        for(unsigned i=0; i<16; i++)
        {
            doNothing();
        }
    }
}