126
votes

It is loosely related to this question: Are std::thread pooled in C++11?. Though the question differs, the intention is the same:

Question 1: Does it still make sense to use your own (or 3rd-party library) thread pools to avoid expensive thread creation?

The conclusion in the other question was that you cannot rely on std::thread to be pooled (it might or it might be not). However, std::async(launch::async) seems to have a much higher chance to be pooled.

It don't think that it is forced by the standard, but IMHO I would expect that all good C++11 implementations would use thread pooling if thread creation is slow. Only on platforms where it is inexpensive to create a new thread, I would expect that they always spawn a new thread.

Question 2: This is just what I think, but I have no facts to prove it. I may very well be mistaken. Is it an educated guess?

Finally, here I have provided some sample code that first shows how I think thread creation can be expressed by async(launch::async):

Example 1:

 thread t([]{ f(); });
 // ...
 t.join();

becomes

 auto future = async(launch::async, []{ f(); });
 // ...
 future.wait();

Example 2: Fire and forget thread

 thread([]{ f(); }).detach();

becomes

 // a bit clumsy...
 auto dummy = async(launch::async, []{ f(); });

 // ... but I hope soon it can be simplified to
 async(launch::async, []{ f(); });

Question 3: Would you prefer the async versions to the thread versions?


The rest is no longer part of the question, but only for clarification:

Why must the return value be assigned to a dummy variable?

Unfortunately, the current C++11 standard forces that you capture the return value of std::async, as otherwise the destructor is executed, which blocks until the action terminates. It is by some considered an error in the standard (e.g., by Herb Sutter).

This example from cppreference.com illustrates it nicely:

{
  std::async(std::launch::async, []{ f(); });
  std::async(std::launch::async, []{ g(); });  // does not run until f() completes
}

Another clarification:

I know that thread pools may have other legitimate uses but in this question I am only interested in the aspect of avoiding expensive thread creation costs.

I think there are still situations where thread pools are very useful, especially if you need more control over resources. For example, a server might decide to handle only a fixed number of requests simultaneously to guarantee fast response times and to increase the predictability of memory usage. Thread pools should be fine, here.

Thread-local variables may also be an argument for your own thread pools, but I'm not sure whether it is relevant in practice:

  • Creating a new thread with std::thread starts without initialized thread-local variables. Maybe this is not what you want.
  • In threads spawned by async, it is somewhat unclear for me because the thread could have been reused. From my understanding, thread-local variables are not guaranteed to be resetted, but I may be mistaken.
  • Using your own (fixed-size) thread pools, on the other hand, gives you full control if you really need it.
1
"However, std::async(launch::async) seems to have a much higher chance to be pooled." No, I believe its std::async(launch::async | launch::deferred) that may be pooled. With just launch::async the task is supposed to be launched on a new thread regardless of what other tasks are running. With the policy launch::async | launch::deferred then the implementation gets to choose which policy, but more importantly it gets to delay choosing which policy. That is, it can wait until a thread in a thread pool becomes available and then choose the async policy.bames53
As far as I know only VC++ uses a thread pool with std::async(). I'm still curious to see how they support non-trivial thread_local destructors in a thread pool.bames53
@bames53 I stepped through the libstdc++ that comes with gcc 4.7.2 and found that if the launch policy is not exactly launch::async then it treats it as if it were only launch::deferred and never executes it asynchronously - so in effect, that version of libstdc++ "chooses" to always use deferred unless forced otherwise.doug65536
@doug65536 My point about thread_local destructors was that destruction on thread exit isn't quite correct when using thread pools. When a task is run asynchronously it's run 'as if on a new thread', according to the spec, which means every async task gets its own thread_local objects. A thread pool based implementation has to take special care to ensure that tasks sharing the same backing thread still behave as if they have their own thread_local objects. Consider this program: pastebin.com/9nWUT40hbames53
@bames53 Using "as if on a new thread" in the spec was a huge mistake in my opinion. std::async could have been a beautiful thing for performance - it could have been the standard short-running-task execution system, naturally backed by a thread pool. Right now, it's just a std::thread with some crap tacked on to make the thread function be able to return a value. Oh, and they added redundant "deferred" functionality that overlaps the job of std::function completely.doug65536

1 Answers

63
votes

Question 1:

I changed this from the original because the original was wrong. I was under the impression that Linux thread creation was very cheap and after testing I determined that the overhead of function call in a new thread vs. a normal one is enormous. The overhead for creating a thread to handle a function call is something like 10000 or more times slower than a plain function call. So, if you're issuing a lot of small function calls, a thread pool might be a good idea.

It's quite apparent that the standard C++ library that ships with g++ doesn't have thread pools. But I can definitely see a case for them. Even with the overhead of having to shove the call through some kind of inter-thread queue, it would likely be cheaper than starting up a new thread. And the standard allows this.

IMHO, the Linux kernel people should work on making thread creation cheaper than it currently is. But, the standard C++ library should also consider using pool to implement launch::async | launch::deferred.

And the OP is correct, using ::std::thread to launch a thread of course forces the creation of a new thread instead of using one from a pool. So ::std::async(::std::launch::async, ...) is preferred.

Question 2:

Yes, basically this 'implicitly' launches a thread. But really, it's still quite obvious what's happening. So I don't really think the word implicitly is a particularly good word.

I'm also not convinced that forcing you to wait for a return before destruction is necessarily an error. I don't know that you should be using the async call to create 'daemon' threads that aren't expected to return. And if they are expected to return, it's not OK to be ignoring exceptions.

Question 3:

Personally, I like thread launches to be explicit. I place a lot of value on islands where you can guarantee serial access. Otherwise you end up with mutable state that you always have to be wrapping a mutex around somewhere and remembering to use it.

I liked the work queue model a whole lot better than the 'future' model because there are 'islands of serial' lying around so you can more effectively handle mutable state.

But really, it depends on exactly what you're doing.

Performance Test

So, I tested the performance of various methods of calling things and came up with these numbers on an 8 core (AMD Ryzen 7 2700X) system running Fedora 29 compiled with clang version 7.0.1 and libc++ (not libstdc++):

   Do nothing calls per second:   35365257                                      
        Empty calls per second:   35210682                                      
   New thread calls per second:      62356                                      
 Async launch calls per second:      68869                                      
Worker thread calls per second:     970415                                      

And native, on my MacBook Pro 15" (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz) with Apple LLVM version 10.0.0 (clang-1000.10.44.4) under OSX 10.13.6, I get this:

   Do nothing calls per second:   22078079
        Empty calls per second:   21847547
   New thread calls per second:      43326
 Async launch calls per second:      58684
Worker thread calls per second:    2053775

For the worker thread, I started up a thread, then used a lockless queue to send requests to another thread and then wait for a "It's done" reply to be sent back.

The "Do nothing" is just to test the overhead of the test harness.

It's clear that the overhead of launching a thread is enormous. And even the worker thread with the inter-thread queue slows things down by a factor of 20 or so on Fedora 25 in a VM, and by about 8 on native OS X.

I created a Bitbucket project holding the code I used for the performance test. It can be found here: https://bitbucket.org/omnifarious/launch_thread_performance