We have to design a system that runs parallel algorithms in iterations and sync after certain steps, kind of fork-join model. Sync after few steps is required to exchange data via shared memory to continue the next iterations.
This loop(s) will continue until user specified time.
One loop will act as controller to coordinate the sync points(spinlock in our case).
Goal is also to run as many iterations as possible (no sleep) in these code path.
When we modeled the above behavior in multiple processes vs multiple threads, threads are not scaling as good as processes.
This is not a memory intensive application. Both on windows, linux the c++ code shows similar pattern .
In first design,
Controller is in one application and manages spinlock and other 3 applications are launched waiting for respective spinlock. In second design, same logic is deployed as multiple threads is one application.
Benchmark for our design is to maximize the count of sync point in given time. As I increased numberof processes or threads performance degrades, but threads degrade is bad. Even though 5 cores are 100% loaded, in both cases, threads are bad after number 4. Our plan is to keep 6 threads maximum . To eliminate context switch overhead, boost fibers are tried. But results not promising.
Why threads are not performing on par with multiple processes?
We did tests on intel i7 desktop with same configuration for windows, linux .