I have written a computationally intensive program that is running serially on an ordinary terminal, but is taking about 30 s per iteration of my loop. I decided to run my code with OpenMP on Visual Studio with the Intel Compiler to parallelise the for loops in it.
Firstly, the same serial code that takes 30s per iteration is taking 600s per iteration on Visual Studio with the Intel Compiler as compared to the ordinary G++ compiler on a normal terminal.
Secondly, even on implementing parallel for loops, the code takes approximately the same time to perform one iteration. I have attached a simplistic version of the code below.
I am worried about race conditions as the parallelised loops access the same unordered map and read and write from it, but no two threads ever access the same element.
I've tried using both solutions mentioned in this SO thread, buckets for the unordered map as well as using #pragma omp task. Both give me similar times of execution. OpenMP/__gnu_parallel for an unordered_map
unordered_map <int, Cluster*> cluster_map;
while(true)
{
#pragma omp parallel
{
#pragma omp for schedule(dynamic, bc/4)
for (unsigned b = 0; b<bc; ++b)
{
for (auto c = clusters.begin(b); c != clusters.end(b); ++c)
{
//Computations involving the current cluster
}
}
}
}
Overall I'm guessing the issue is with Visual Studio, but since I'm new to OpenMP it could also be with my code. My project properties are set to use the Intel Compiler and OpenMP Support is enabled.
Cluster
objects in the same cache line from different threads, that will create false sharing (cache-line ping pong). But if all the shared / nearby data is read-only, you should be fine for performance. As long as you don't modify the container itself, you should also be fine for correctness; it's not a thread-safe container, but modifying e.g. 2 adjacent elements of achar[]
array from different threads is always fine according to C++11 – Peter Cordes