1
votes

I'm trying to get to the bottom of some rather disappointing performance results we've been getting for our HPC applications. I wrote the following benchmark in Visual Studio 2010 that distills the essence of our applications (lots of independent, high arithmetic intensity operations):

#include "stdafx.h"
#include <math.h>
#include <time.h>
#include <Windows.h>
#include <stdio.h>
#include <memory.h>
#include <process.h>

void makework(void *jnk) {
    double tmp = 0;
    for(int j=0; j<10000; j++) {
        for(int i=0; i<1000000; i++) {
            tmp = tmp+(double)i*(double)i;
        }
    }
    *((double *)jnk) = tmp;
    _endthread();
}

void spawnthreads(int num) {
    HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
    double *junk = (double *)malloc(num*sizeof(double));
    printf("Starting %i threads... ", num);
    for(int i=0; i<num; i++) {
        hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
    }
    int start = GetTickCount();
    WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
    int end = GetTickCount();
    FILE *fp = fopen("makework.log", "a+");
    fprintf(fp, "%i,%.3f\n", num, (double)(end-start)/1000.0);
    fclose(fp);
    printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
    free(hThreads);
    free(junk);
}

int _tmain(int argc, _TCHAR* argv[])
{
    for(int i=1; i<=20; i++) {
        spawnthreads(i);
    }
    return 0;
}

I'm doing the exact same operation in each thread, so it should (ideally) take a constant ~11 seconds until I've filled up the physical cores, and then maybe double when I start using logical hyperthreaded cores. There shouldn't be any cache concerns since my loop variables and the results can fit into registers.

Here are the results of my experiment on two testbeds, both running Windows Server 2008.

Machine 1 Dual Xeon X5690 @ 3.47 GHz -- 12 physical cores, 24 logical cores, Westmere architecture

Starting 1 threads... Elapsed time: 11.575 seconds
Starting 2 threads... Elapsed time: 11.575 seconds
Starting 3 threads... Elapsed time: 11.591 seconds
Starting 4 threads... Elapsed time: 11.684 seconds
Starting 5 threads... Elapsed time: 11.825 seconds
Starting 6 threads... Elapsed time: 12.324 seconds
Starting 7 threads... Elapsed time: 14.992 seconds
Starting 8 threads... Elapsed time: 15.803 seconds
Starting 9 threads... Elapsed time: 16.520 seconds
Starting 10 threads... Elapsed time: 17.098 seconds
Starting 11 threads... Elapsed time: 17.472 seconds
Starting 12 threads... Elapsed time: 17.519 seconds
Starting 13 threads... Elapsed time: 17.395 seconds
Starting 14 threads... Elapsed time: 17.176 seconds
Starting 15 threads... Elapsed time: 16.973 seconds
Starting 16 threads... Elapsed time: 17.144 seconds
Starting 17 threads... Elapsed time: 17.129 seconds
Starting 18 threads... Elapsed time: 17.581 seconds
Starting 19 threads... Elapsed time: 17.769 seconds
Starting 20 threads... Elapsed time: 18.440 seconds

Machine 2 Dual Xeon E5-2690 @ 2.90 GHz -- 16 physical cores, 32 logical cores, Sandy Bridge architecture

Starting 1 threads... Elapsed time: 10.249 seconds
Starting 2 threads... Elapsed time: 10.562 seconds
Starting 3 threads... Elapsed time: 10.998 seconds
Starting 4 threads... Elapsed time: 11.232 seconds
Starting 5 threads... Elapsed time: 11.497 seconds
Starting 6 threads... Elapsed time: 11.653 seconds
Starting 7 threads... Elapsed time: 11.700 seconds
Starting 8 threads... Elapsed time: 11.888 seconds
Starting 9 threads... Elapsed time: 12.246 seconds
Starting 10 threads... Elapsed time: 12.605 seconds
Starting 11 threads... Elapsed time: 13.026 seconds
Starting 12 threads... Elapsed time: 13.041 seconds
Starting 13 threads... Elapsed time: 13.182 seconds
Starting 14 threads... Elapsed time: 12.885 seconds
Starting 15 threads... Elapsed time: 13.416 seconds
Starting 16 threads... Elapsed time: 13.011 seconds
Starting 17 threads... Elapsed time: 12.949 seconds
Starting 18 threads... Elapsed time: 13.011 seconds
Starting 19 threads... Elapsed time: 13.166 seconds
Starting 20 threads... Elapsed time: 13.182 seconds

Here are the aspects I find puzzling:

  • Why does the time elapsed with the Westmere machine stay constant til about 6 cores, then jump suddenly, and then stay basically constant above 10 threads? Is Windows stuffing all the threads in to a single processor before moving on to the second one, so that hyperthreading kicks in nondeterministically after one processor is filled?

  • Why does the time elapsed with the Sandy Bridge machine increase basically linearly with the number of threads until about 12? Twelve doesn't seem like a meaningful number to me considering the number of cores.

Any thoughts and suggestions on processor counters to measure/ways to improve my benchmark are appreciated. Is this an architecture problem or a Windows problem?

Edit:

As suggested below, the compiler was doing some strange things so I wrote my own assembly code that does the same thing as above but leaves all FP operations on the FP stack to avoid any memory accesses:

void makework(void *jnk) {
    register int i, j;
//  register double tmp = 0;
    __asm {
        fldz  // this holds the result on the stack
    }
    for(j=0; j<10000; j++) {
        __asm {
            fldz // push i onto the stack: stack = 0, res
        }
        for(i=0; i<1000000; i++) {
            // tmp += (double)i * (double)i;
            __asm {
                fld st(0)  // stack: i, i, res
                fld st(0)  // stack: i, i, i, res
                fmul       // stack: i*i, i, res
                faddp st(2), st(0) // stack: i, res+i*i
                fld1       // stack: 1, i, res+i*i
                fadd      // stack: i+1, res+i*i
            }
        }
        __asm {
            fstp st(0)   // pop i off the stack leaving only res in st(0)
        }
    }
    __asm {
        mov eax, dword ptr [jnk]
        fstp qword ptr [eax]
    }
//  *((double *)jnk) = tmp;
    _endthread();
}

This assembles as:

013E1002  in          al,dx  
013E1003  fldz  
013E1005  mov         ecx,2710h  
013E100A  lea         ebx,[ebx]  
013E1010  fldz  
013E1012  mov         eax,0F4240h  
013E1017  fld         st(0)  
013E1019  fld         st(0)  
013E101B  fmulp       st(1),st  
013E101D  faddp       st(2),st  
013E101F  fld1  
013E1021  faddp       st(1),st  
013E1023  dec         eax  
013E1024  jne         makework+17h (13E1017h)  
013E1026  fstp        st(0)  
013E1028  dec         ecx  
013E1029  jne         makework+10h (13E1010h)  
013E102B  mov         eax,dword ptr [jnk]  
013E102E  fstp        qword ptr [eax]  
013E1030  pop         ebp  
013E1031  jmp         dword ptr [__imp___endthread (13E20C0h)]  

The results for machine 1 above are:

Starting 1 threads... Elapsed time: 12.589 seconds
Starting 2 threads... Elapsed time: 12.574 seconds
Starting 3 threads... Elapsed time: 12.652 seconds
Starting 4 threads... Elapsed time: 12.682 seconds
Starting 5 threads... Elapsed time: 13.011 seconds
Starting 6 threads... Elapsed time: 13.790 seconds
Starting 7 threads... Elapsed time: 16.411 seconds
Starting 8 threads... Elapsed time: 18.003 seconds
Starting 9 threads... Elapsed time: 19.220 seconds
Starting 10 threads... Elapsed time: 20.124 seconds
Starting 11 threads... Elapsed time: 20.764 seconds
Starting 12 threads... Elapsed time: 20.935 seconds
Starting 13 threads... Elapsed time: 20.748 seconds
Starting 14 threads... Elapsed time: 20.717 seconds
Starting 15 threads... Elapsed time: 20.608 seconds
Starting 16 threads... Elapsed time: 20.685 seconds
Starting 17 threads... Elapsed time: 21.107 seconds
Starting 18 threads... Elapsed time: 21.451 seconds
Starting 19 threads... Elapsed time: 22.043 seconds
Starting 20 threads... Elapsed time: 22.745 seconds

So it's about 9% slower with one thread (the difference between inc eax vs. fld1 and faddp, perhaps?), and when all the physical cores are filled it's almost twice as slow (which would be expected from hyperthreading). But, the puzzling aspect of degraded performance starting at only 6 threads still remains...

4
Very (very) random idea: The scheduler is shuffling the threads around cores needlessly. Do the results look the same if you manually set CPU affinity for each thread?us2012
Also, are you running your threads with real-time priority?jxh
Well, it is not a scheduler problem. You'd expect a signal again at 12 and 18 threads. The Xeon really does have only 6 cores. I'd guess at a numa setup with two chips, you see the interconnect overhead.Hans Passant
I would recommend switching to CreatedThread, that was working fine for decades. _beginthread is relatively new thing.... Who knows.Kirill Kobelev
@KirillKobelev Relative to what?? _beginthread has been part of the MSVC runtime for near twenty years. And for a very long time after release, if your program used any functionality of the runtime library it wasn't even optional; it was mandatory. It is the responsible party for setting up all CRT-based TLS-held data. It surely isn't new, relatively or otherwise. And it should have no effect beyond startup costs of a thread's performance.WhozCraig

4 Answers

2
votes

Now to be totally lame and answer my own question -- it seems to be the scheduler as @us2012 suggested. I hardcoded the affinity masks to fill up physical cores first, then switch to hyperthreaded cores:

void spawnthreads(int num) {
    ULONG_PTR masks[] = {  // for my system; YMMV
        0x1, 0x4, 0x10, 0x40, 0x100, 0x400, 0x1000, 0x4000, 0x10000, 0x40000, 
        0x100000, 0x400000, 0x2, 0x8, 0x20, 0x80, 0x200, 0x800, 0x2000, 0x8000};
    HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
    double *junk = (double *)malloc(num*sizeof(double));
    printf("Starting %i threads... ", num);
    for(int i=0; i<num; i++) {
        hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
        SetThreadAffinityMask(hThreads[i], masks[i]);
    }
    int start = GetTickCount();
    WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
    int end = GetTickCount();
    FILE *fp = fopen("makework.log", "a+");
    fprintf(fp, "%i,%.3f,%f\n", num, (double)(end-start)/1000.0, junk[0]);
    fclose(fp);
    printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
    free(hThreads);
}

and get

Starting 1 threads... Elapsed time: 12.558 seconds
Starting 2 threads... Elapsed time: 12.558 seconds
Starting 3 threads... Elapsed time: 12.589 seconds
Starting 4 threads... Elapsed time: 12.652 seconds
Starting 5 threads... Elapsed time: 12.621 seconds
Starting 6 threads... Elapsed time: 12.777 seconds
Starting 7 threads... Elapsed time: 12.636 seconds
Starting 8 threads... Elapsed time: 12.886 seconds
Starting 9 threads... Elapsed time: 13.057 seconds
Starting 10 threads... Elapsed time: 12.714 seconds
Starting 11 threads... Elapsed time: 12.777 seconds
Starting 12 threads... Elapsed time: 12.668 seconds
Starting 13 threads... Elapsed time: 26.489 seconds
Starting 14 threads... Elapsed time: 26.505 seconds
Starting 15 threads... Elapsed time: 26.505 seconds
Starting 16 threads... Elapsed time: 26.489 seconds
Starting 17 threads... Elapsed time: 26.489 seconds
Starting 18 threads... Elapsed time: 26.676 seconds
Starting 19 threads... Elapsed time: 26.770 seconds
Starting 20 threads... Elapsed time: 26.489 seconds

which is as expected. Now the question is, what OS settings can I twiddle to make this closer to the default behavior since most of our code is written in MATLAB...

1
votes

(Possible explanation) Have you checked for background activities on these machines? It may happen that OS cannot fully dedicate all its cores to you. On your Machine 1 considerable growth starts when you start to occupy more than half of the cores. You threads may compete for resources with something else.

You may also want to check for restrictions and domain policies on your computer/account that do not allow grabbing all available resources.

1
votes

On my laptop with 2 physical cores and 4 logical cores, I get:

<br>
Starting 1 threads... Elapsed time: 11.638 seconds<br>
Starting 2 threads... Elapsed time: 12.418 seconds<br>
Starting 3 threads... Elapsed time: 13.556 seconds<br>
Starting 4 threads... Elapsed time: 14.929 seconds<br>
Starting 5 threads... Elapsed time: 20.811 seconds<br>
Starting 6 threads... Elapsed time: 22.776 seconds<br>
Starting 7 threads... Elapsed time: 27.160 seconds<br>
Starting 8 threads... Elapsed time: 30.249 seconds<br>

Which shows degradation as soon as we have more than 1 thread.

I suspect the reason is that function makework() is doing memory accesses. You can see this in Visual Studio 2010 by setting a breakpoint on the 1st line of _tmain(). When you hit the breakpoint, press Ctrl-Alt-D to see the disassembly window. Anywhere you see a register name in brackets (e.g. [esp] ), it is a memory access. The on-CPU level 1 memory cache bandwidth is saturating. You can test this theory with a modified makework();

    void makework(void *jnk) {
    double tmp = 0;
    volatile double *p;
    int i;
    int j;
    p=(double*)jnk;

    for(j=0; j<100000000; j++) {
        for(i=0; i<100; i++) {
            tmp = tmp+(double)i*(double)i;
        }
        *p=tmp;
    }
    *p = tmp;
    _endthread();
}

It does the same number of computations, but with an extra memory write thrown in every 100 iterations. On my laptop, the results are:

Starting 1 threads... Elapsed time: 11.684 seconds<br>
Starting 2 threads... Elapsed time: 13.760 seconds<br>
Starting 3 threads... Elapsed time: 14.445 seconds<br>
Starting 4 threads... Elapsed time: 17.519 seconds<br>
Starting 5 threads... Elapsed time: 23.369 seconds<br>
Starting 6 threads... Elapsed time: 25.491 seconds<br>
Starting 7 threads... Elapsed time: 30.155 seconds<br>
Starting 8 threads... Elapsed time: 34.460 seconds<br>

Which shows the impact memory access can have on the results. I tried various VS2010 compiler settings to see if I could get makework() to have no memory accesses, but no luck. To truly study the raw CPU core performance vs # of active threads, I suspect we'd have to code a makework() in assembler.

0
votes

Ok, now that we've ruled out the memory saturation theory (although - x87? ouch, don't expect much performance there. Try to switch to SSE/AVX if you can live with what they provide). The core scaling should still make sense, let's look at the CPU models you used:

can you validate that these are the correct models?

Intel® Xeon® Processor X5690 (12M Cache, 3.46 GHz, 6.40 GT/s Intel® QPI)

http://ark.intel.com/products/52576

Intel® Xeon® Processor E5-2690 (20M Cache, 2.90 GHz, 8.00 GT/s Intel® QPI)

http://ark.intel.com/products/64596/

if so, then the first indeed has 6 physical cores (12 logical ones), and the second has 8 physical cores (16 logical ones). Come to think of it I don't think you can get higher core count on a single socket at these generations, so it makes sense, and it fits your numbers perfectly.

Edit: On multi-socket system, the OS may prefer the single socket while logical cores are still available there. It might depend on the exact version, but for win server 2008, there's an interesting comment here - http://blogs.technet.com/b/matthts/archive/2012/10/14/windows-server-sockets-logical-processors-symmetric-multi-threading.aspx

quoting:

When the OS boots it starts with socket 1 and enumerates all logical processors:

    on socket 1 it enumerates logical processors 1-20
    on socket 2 it enumerates logical processors 21-40
    on socket 3 it enumerates logical processors 41-60
    on socket 4 it would see 61-64

If this is the order your OS wakes up threads, perhaps SMT kicks in before spilling over to the second socket