18
votes

Executive summary: How can one specify in his code that OpenMP should only use threads for the REAL cores, i.e. not count the hyper-threading ones?

Detailed analysis: Over the years, I've coded a SW-only, open source renderer (rasterizer/raytracer) in my free time. The GPL code and Windows binaries are available from here: https://www.thanassis.space/renderer.html It compiles and runs fine under Windows, Linux, OS/X and the BSDs.

I introduced a raytracing mode this last month - and the quality of the generated pictures sky-rocketed. Unfortunately, raytracing is orders of magnitude slower than rasterizing. To increase speed, just as I did for the rasterizers, I added OpenMP (and TBB) support to the raytracer - to easily make use of additional CPU cores. Both rasterizing and raytracing are easily amenable to threading (work per triangle - work per pixel).

At home, with my Core2Duo, the 2nd core helped all the modes - both the rasterizing and the raytracing modes got a speedup that is between 1.85x and 1.9x.

The problem: Naturally, I was curious to see the top CPU performance (I also "play" with GPUs, preliminary CUDA port), so I wanted a solid base for comparisons. I gave the code to a good friend of mine, who has access to a "beast" machine, with a 16-core, 1500$ Intel super processor.

He runs it in the "heaviest" mode, the raytracer mode...

...and he gets one fifth the speed of my Core2Duo (!)

Gasp - horror. What just happened?

We started trying different modifications, patches, ... and eventually we figured it out.

By using the OMP_NUM_THREADS environment variable, one can control how many OpenMP threads are spawned. As the number of threads was increasing from 1 up to 8, the speed was increasing (close to a linear increase). The moment we crossed 8, speed started to diminish, until it nose-dived to one fifth the speed of my Core2Duo, when all 16 cores were used!

Why 8?

Because 8 was the number of the real cores. The other 8 were... hyperthreading ones!

The theory: Now, this was news to me - I've seen hyper-threading help a lot (up to 25%) in other algorithms, so this was unexpected. Apparently, even though each hyper-threading core comes with its own registers (and SSE unit?), the raytracer could not make use of the extra processing power. Which lead me to think...

It is probably not processing power that is starved - it is memory bandwidth.

The raytracer uses a bounding volume hierarchy data structure, to accelerate ray-triangle intersections. If the hyperthreaded cores are used, then each of the "logical cores" in a pair, is trying to read from different places in that data structure (i.e. in memory) - and the CPU caches (local per pair) are completely thrashed. At least, that's my theory - any suggestions most welcome.

So, the question: OpenMP detects the number of "cores" and spawns threads to match it - that is, it includes the hyperthreaded "cores" in the calculation. In my case, this apparently leads to disastrous results, speed-wise. Does anyone know how to use the OpenMP API (if possible, portably) to only spawn threads for the REAL cores, and not the hyperthreaded ones?

P.S. The code is open (GPL) and available at the link above, feel free to reproduce on your own machine - I am guessing this will happen in all hyperthreaded CPUs.

P.P.S. Excuse the length of the post, I thought it was an educational experience and wanted to share.

3
This post has some helpful answers. "stackoverflow.com/questions/150355/…"Dan
Unfortunately, these don't help much - they all report a number that includes the hyperthreaded "cores"...ttsiodras
I have found that 'hyperthreading' can be crap for a lot of applications. I have turned it off (in the bios) in many cases due to applications no longer functioning or performing much worse. This isn't just intel (seen it on power as well).Marm0t

3 Answers

6
votes

Basically, you need some fairly portable way of querying the environment for fairly low-level hardware details - and generally, you can't do that from just system calls (the OS is generally unaware even of the difference between hardware threads and cores).

One library which supports a number of platforms is hwloc - supports Linux & windows (and others), intel & amd chips. Hwloc will let you find everything out about the hardware topology, and knows the difference between cores and hardware threads (called PUs - processing units - in hwloc terminology). So you'd call this library at the start, find the number of actual cores, and call omp_set_num_threads() (or just add that variable as a directive at the start of parallel sections).

3
votes

Unfortunately your assumption about why this is occurring is most likely correct. To be sure, you would have to use a profile tool - but I have seen this before with raytracing, so it is not surprising. In any case, there is currently no way to determine from OpenMP that some of the processors are "real" and some are hyperthreaded. You could write some code to determine this and then set the number yourself. However, there would still be the problem that OpenMP doesn't schedule the threads on the processors itself - it allows the OS to do that.

There has been work in the OpenMP ARB language committee to try and define a standard way for the user to determine his environment and say how to run. At this time, this discussion is still raging on. Many implementations allow you to "bind" the threads to the processors, by use of an implementation defined environment variable. However, the user has to know the processor numbering and which processors are "real" vs. hyperthreaded.

1
votes

The problem is how OMP uses HT. It's not memory bandwidth! I tried simple loop on my 2.6GHz HT PIV. The result is amazing...

With OMP:

    $ time ./a.out 
    4500000000
    real    0m28.360s
    user    0m52.727s
    sys 0m0.064s

Without OMP: $ time ./a.out 4500000000

    real0   m25.417s
    user    0m25.398s
    sys 0m0.000s

Code:

    #include <stdio.h>
    #define U64 unsigned long long
    int main() {
      U64 i;
      U64 N = 1000000000ULL; 
      U64 k = 0;
      #pragma omp parallel for reduction(+:k)
      for (i = 0; i < N; i++) 
      {
        k += i%10; // last digit
      }
      printf ("%llu\n", k);
      return 0;
    }