Why is OpenMP outperforming threads?

Question

I've been calling this in OpenMP

#pragma omp parallel for num_threads(totalThreads)
for(unsigned i=0; i<totalThreads; i++)
{
workOnTheseEdges(startIndex[i], endIndex[i]);
}

And this in C++11 std::threads (I believe those are just pthreads)

vector<thread> threads;
for(unsigned i=0; i<totalThreads; i++)
{
threads.push_back(thread(workOnTheseEdges,startIndex[i], endIndex[i])); 
}
for (auto& thread : threads)
{
 thread.join();
}

But, the OpenMP implementation is 2x the speed--Faster! I would have expected C++11 threads to be faster, as they are more low-level. Note: The code above is being called not just once, but probably 10,000 times in a loop, so maybe that has something to do with it?

Edit: for clarification, in practice, I either use the OpenMP or the C++11 version--not both. When I am using the OpenMP code, it takes 45 seconds and when I am using the the C++11, it takes 100 seconds.

My crystal ball fails to reveal what the value of totalThreads is, how many cores/HW threads your CPU has, what the size of startIndex is and how much time it takes to execute workOnTheseEdges() once. — Hristo Iliev
They aren't doing the same thing. The OpenMP version is distributing 10,000 tasks over 16 threads. The C++11 version is running 10,000 tasks on 10,000 threads. Threads are expensive, and having more threads than cores is even more expensive. You can't just throw new threads at every small task (unless you happen to have 10,000 or so cores to run them on). The OpenMP version is taking care of this for you. — adpalumbo
@user2588666: You said "the above code is called in a loop". Each and every time it's called, the std::thread version creates totalThreads new threads, but the OpenMP is reusing the same 16 each time the loop executes. — Mooing Duck
@user2588666: Visual studio implements std::async to reuse the same threads. Other than that you'd have to manage the 16 threads yourself (which is pretty easy for your case. Make the threads vector static: coliru.stacked-crooked.com/a/3fdad471c0c26d41) — Mooing Duck

adpalumbo adpalumbo · Accepted Answer · 2014-04-23T21:51:29

Where does totalThreads come from in your OpenMP version? I bet it's not startIndex.size().

The OpenMP version queues the requests onto totalThreads worker threads. It looks like the C++11 version creates, startIndex.size() threads, which involves a ridiculous amount of overhead if that's a big number.

Why is OpenMP outperforming threads?

2 Answers