C++17 upgraded 69 STL algorithms to support parallelism, by the use of an optional ExecutionPolicy parameter (as the 1st argument). eg.
std::sort(std::execution::par, begin(v), end(v));
I suspect the C++17 standard deliberately says nothing about how to implement the multi-threaded algorithms, leaving it up to the library writers to decide what is best (and allowing them to change their minds, later). Still, I'm keen to understand at a high level what issues are being considered in the implementation of the parallel STL algorithms.
Some questions on my mind include (but are not limited to!):
- How is the maximum number of threads used (by the C++ application) related to the number of CPU &/or GPU cores on the machine?
- What differences are there in the number of threads each algorithm uses? (Will each algorithm always use the same number of threads in every circumstance?)
- Is there any consideration given to other parallel STL calls on other threads (within the same app)? (eg. if a thread invokes std::for_each(par,...), will it use more/less/same threads depending on if a std::sort(par, ...) is already running on some other thread(s)? Is there a thread pool perhaps?)
- Is any consideration given to how busy the cores are due to external factors? (eg. if 1 core is very busy, say analysing SETI signals, will the C++ application reduce the number of threads it uses?)
- Do some algorithms only use CPU cores? or only GPU cores?
- I suspect implementations will vary from library to library (compiler to compiler?), even details about this would be interesting.
I realise the point of these parallel algorithms is to shield the Programmer from having to worry about these details. However, any info that gives me a high-level mental picture of what's going on inside the library calls would be appreciated.