11
votes

I’ve run my code in a variety of circumstances which has resulted in what I believe to be odd behavior. My testing was on a dual core intel xeon processor with HT.

No OpenMP '#pragma' statement, total runtime = 507 seconds

With OpenMP '#pragma' statement specifying 1 core, total runtime = 117 seconds

With OpenMP '#pragma' statement specifying 2 core, total runtime = 150 seconds

With OpenMP '#pragma' statement specifying 3 core, total runtime = 157 seconds

With OpenMP '#pragma' statement specifying 4 core, total runtime = 144 seconds

I guess I can’t figure out why commenting out my openmp line makes the program slow down so much between 1 thread without openmp and 1 thread WITH openmp.

All I am changing is between:

//#pragma omp parallel for shared(segs) private(i, j, p_hough) num_threads(1) schedule(guided)

and...

#pragma omp parallel for shared(segs) private(i, j, p_hough) num_threads(1,2,3,4) schedule(guided)

Anyways, if anyone has any idea why this may be happening, please let me know!

Thanks for any help,

Brett

EDIT: I'll address some of the comments here

I am using num_threads(1), num_threads(2), etc..

With further investigation, it turns out that my results are inconsistent based upon whether or not the "schedule(guided)" line is included in the code.

-When I'm utilizing the schedule(guided) line, I generate the fastest solution, regardless of the number of threads. -When I'm using the default scheduler, my results are significantly slower and different values -With schedule(guided) improvement is not gained with increased threads -Without the schedule(guided) I gain improvement with addition of threads

I guess I haven't found a good enough description of what schedule(guided) does for me, I do understand that it tries to split up the loop so that the most time intensive iterations happen first, which should have an effect of the least amount of time that one thread waits for the others to complete their iterations.

It appears that for my ~900 iteration loop, when I use schedule(guided), I'm only processing ~200 iterations, where as without the schedule(guided) I'm processing all 900 iterations. Any thoughts?

1
Is the program still producing the correct results? Perhaps you found a bug in your compilers OpenMP implementation.Gregor Brandt
Try removing schedule(guided)Jacob
Are you sure you used the same compiler flags (especially optimization flags) in every case?KeithB
gbrandt: Thank you for the comments, I'm checking the output right now to verify that I'm getting what I expect from the program, and I'll follow up when finished.Brett
Jacob: When I remove the schedule(guided), I get the same result as not having the #pragma... I'm still confused at how this is possible...Brett

1 Answers

8
votes

OpenMP has significant synchronization overheads. I have found that unless you have a really big loop that does a lot of work, and has no intra-loop synchronization, then it is generally not worthwhile using OpenMP.

I think that when you set the number of threads to one (1), OpenMP simply does a procedure call to the OpenMP procedure implementing the loop, so the overhead is minimal, and performance is essentially identical to the non-OpenMP case.

Otherwise, I think OpenMP sets some semaphores, and waiting "worker" threads wake up, synchronize their access to the data structures telling them what loop parameters to set, and then call the routine that does the work, and when they complete the chunk of work, they signal the master thread again. This synchronization must happen for each chunk of work that a thread does, and the synchronization costs are non-trivial.

Using the STATIC scheduling option can help reduce the scheduling/synchronization overheads, particularly if the number of loop iterations is large relative to the number of cores.