How can I reduce the effect of pthread_join. Mingw32, c++

Question

I created a C++ function that is a part of a bigger project. This function is called a lot. In order to enhance the performance we decided to split that function into 4 parts, each two running in parallel. The complete program takes one input, and one input only, then it does a simulation, it passes a variable of length 2000 to the function in question.

This function operates on the variable (20,096 max operations,150,000 additions, and no multiplications). Those operations are done by func1 and func2 in parallel, twice (so every time each function does quarter of those opperations). both functions share the same input in memory (double Signal of size 700 (read only), double A, B, C, H, (all of size (double) 5600, write and read) ) and output (double L of size 700).

No mutexes are necessary because func1 works on one half of A,B,C,H (read and write), and writes into its half in L, while func2 does the same in its half. However, there are instances where both functions, or threads, are reading Signal at the same time. On the second call, the threads almost do the same operations.

The problem is that the Threaded program runs a bit slower than the serial program. When I time each func alone they run 1/4th of the total function time of the original function time, which makes sense as func1 is called twice, and func2 is called twice as well. I use clock_t clock() for timing (This measures wall clock in windows, not as specified in the standard). but that was coherent with other timing tools like windows QueryPerformanceCounter.

I timed everything, and tried everything I saw. I used the optimizing optoins -O3 O2 Ofast. I created a separate memory for each thread (even for the read only arrays, then copied results).

I have two theories in mind 1- overhead of pthreads is taking as much time as the functions are taking 2- main() is sleeping while waiting for pthread_join.

I am more convinced with theory 2 because they only place time is lost is the somewhere in the pthread_join.

I wrote this sample code to simulate the problem. please note that the loop positions are essential in the algorithm I am implementing, so moving operations to use less loops will not work.

Note that if you increment the size of data (j<10000 and j<5000) and decrease the count range correspondingly, the performance of the threaded program begins to perform better.

This runs in 1.3 seconds.

#include <math.h>
#include <pthread.h>
#include <iostream>
#include <time.h>
using namespace std;

int main(){
    int i,m,j,k;

    clock_t time_time;
    time_time=clock();

    for (int count =0 ; count<50000;++count){
        for (j=0;j<10000;j++){
            m=j;
            k=j+1;
            i=m*j;
        }
    }
    cout<<"time spent = "<< double(clock()-time_time)/CLOCKS_PER_SEC<<endl;
}

This runs in 5 seconds on the same processor.

#include <math.h>
#include <pthread.h>
#include <iostream>
#include <time.h>

using namespace std;

void test (int i);

void *thread_func(void *arg){
    int idxThread = *((int *) arg);
    test (1);
    return NULL;
}    

void test (int i){  
    int j,k,m;
    int q=0,w=1,e=2,r=3,t=4;
    int a=1,s=1,d=1,f=3,g=3;
    for (j=0;j<5000;j++){
        m=j;
        k=j+1;
        i=m*j;
    }
}

int main(){
    int numThreads=2;

    clock_t time_time;
    pthread_t threads[numThreads];
    unsigned int threadIDs[numThreads];
    time_time =clock();

    for (int count =0 ; count<50000;++count){
        for (unsigned int id = 0; id < numThreads; ++id)
        {
            threadIDs[id]=id;
            pthread_create(&(threads[id]), NULL, thread_func, (void *) &(threadIDs[id]));
        }
        for (unsigned int id = 0; id < numThreads; ++id)
        {
            pthread_join(threads[id], NULL);
        }
    }
        cout<<"time spent = "<< double(clock()-time_time)/CLOCKS_PER_SEC<<endl;
}

EDIT: The 50000 calls to the thread function is to illustrate the problem, in my code, they are just 2 calls for func1, and func2, twice, which is 4 creations and joins. which seems to take 2 milliseconds.

OS: windows, mingw32, pthreads C++. CPU i7, RAM:8Gb

makefile: 
CC = g++ -O3 -I............ -Wformat -c 
LINK = g++ -Wl,--stack,8388608 -o
LINKFLAGS = -lpthread

Well your test code does nothing from the compiler's point of view, since you call your test function with a constant argument, so there are chances all your computation code gets optimized away. What you measure, more or less, is the time needed to create and destroy a thread, which seems to be reasonable (around 50us). Creating threads is costly. It's not a good idea to spend your time creating and destroying them if you want fast code. — kuroi neko
If you look at the loop, you're creating and joining 50000 * numThreads threads serially, that's going to be quite expensive. — melak47
I understand that there is an overhead associated with destroying and creating threads, but it is not the problem as I think. I think its something else. 50000 is just an example to illustrate the issue. — user304584
You are starting a thread to run just one iteration of test(). That's pretty wasteful, since your text description sounded like you were actually trying to make each of the two threads process their half of the inputs. — melak47
Here's an example of what I thought you wanted to do (hope you don't mind the C++11) coliru.stacked-crooked.com/a/e4f8bfd9e2daa832 — melak47

David Schwartz David Schwartz · Accepted Answer · 2015-09-07T02:22:37

Don't create and join threads. Keep a pool of threads running and assign them tasks as needed.
Don't wait for tasks to complete unless you have no choice. Instead, have the completion of the task trigger work to be done without waiting.

How can I reduce the effect of pthread_join. Mingw32, c++

2 Answers