I created a C++ function that is a part of a bigger project. This function is called a lot. In order to enhance the performance we decided to split that function into 4 parts, each two running in parallel. The complete program takes one input, and one input only, then it does a simulation, it passes a variable of length 2000 to the function in question.
This function operates on the variable (20,096 max operations,150,000 additions, and no multiplications). Those operations are done by func1 and func2 in parallel, twice (so every time each function does quarter of those opperations). both functions share the same input in memory (double Signal
of size 700 (read only), double A, B, C, H
, (all of size (double) 5600, write and read) ) and output (double L
of size 700).
No mutexes are necessary because func1 works on one half of A,B,C,H (read and write), and writes into its half in L, while func2 does the same in its half. However, there are instances where both functions, or threads, are reading Signal at the same time. On the second call, the threads almost do the same operations.
The problem is that the Threaded program runs a bit slower than the serial program. When I time each func alone they run 1/4th of the total function time of the original function time, which makes sense as func1 is called twice, and func2 is called twice as well. I use clock_t clock() for timing (This measures wall clock in windows, not as specified in the standard). but that was coherent with other timing tools like windows QueryPerformanceCounter.
I timed everything, and tried everything I saw. I used the optimizing optoins -O3 O2 Ofast. I created a separate memory for each thread (even for the read only arrays, then copied results).
I have two theories in mind 1- overhead of pthreads is taking as much time as the functions are taking 2- main() is sleeping while waiting for pthread_join.
I am more convinced with theory 2 because they only place time is lost is the somewhere in the pthread_join.
I wrote this sample code to simulate the problem. please note that the loop positions are essential in the algorithm I am implementing, so moving operations to use less loops will not work.
Note that if you increment the size of data (j<10000 and j<5000) and decrease the count range correspondingly, the performance of the threaded program begins to perform better.
This runs in 1.3 seconds.
#include <math.h>
#include <pthread.h>
#include <iostream>
#include <time.h>
using namespace std;
int main(){
int i,m,j,k;
clock_t time_time;
time_time=clock();
for (int count =0 ; count<50000;++count){
for (j=0;j<10000;j++){
m=j;
k=j+1;
i=m*j;
}
}
cout<<"time spent = "<< double(clock()-time_time)/CLOCKS_PER_SEC<<endl;
}
This runs in 5 seconds on the same processor.
#include <math.h>
#include <pthread.h>
#include <iostream>
#include <time.h>
using namespace std;
void test (int i);
void *thread_func(void *arg){
int idxThread = *((int *) arg);
test (1);
return NULL;
}
void test (int i){
int j,k,m;
int q=0,w=1,e=2,r=3,t=4;
int a=1,s=1,d=1,f=3,g=3;
for (j=0;j<5000;j++){
m=j;
k=j+1;
i=m*j;
}
}
int main(){
int numThreads=2;
clock_t time_time;
pthread_t threads[numThreads];
unsigned int threadIDs[numThreads];
time_time =clock();
for (int count =0 ; count<50000;++count){
for (unsigned int id = 0; id < numThreads; ++id)
{
threadIDs[id]=id;
pthread_create(&(threads[id]), NULL, thread_func, (void *) &(threadIDs[id]));
}
for (unsigned int id = 0; id < numThreads; ++id)
{
pthread_join(threads[id], NULL);
}
}
cout<<"time spent = "<< double(clock()-time_time)/CLOCKS_PER_SEC<<endl;
}
EDIT: The 50000 calls to the thread function is to illustrate the problem, in my code, they are just 2 calls for func1, and func2, twice, which is 4 creations and joins. which seems to take 2 milliseconds.
OS: windows, mingw32, pthreads C++. CPU i7, RAM:8Gb
makefile:
CC = g++ -O3 -I............ -Wformat -c
LINK = g++ -Wl,--stack,8388608 -o
LINKFLAGS = -lpthread
test()
. That's pretty wasteful, since your text description sounded like you were actually trying to make each of the two threads process their half of the inputs. – melak47