Strange Multithreading Performance

Question

I'm trying to get to the bottom of some rather disappointing performance results we've been getting for our HPC applications. I wrote the following benchmark in Visual Studio 2010 that distills the essence of our applications (lots of independent, high arithmetic intensity operations):

#include "stdafx.h"
#include <math.h>
#include <time.h>
#include <Windows.h>
#include <stdio.h>
#include <memory.h>
#include <process.h>

void makework(void *jnk) {
    double tmp = 0;
    for(int j=0; j<10000; j++) {
        for(int i=0; i<1000000; i++) {
            tmp = tmp+(double)i*(double)i;
        }
    }
    *((double *)jnk) = tmp;
    _endthread();
}

void spawnthreads(int num) {
    HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
    double *junk = (double *)malloc(num*sizeof(double));
    printf("Starting %i threads... ", num);
    for(int i=0; i<num; i++) {
        hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
    }
    int start = GetTickCount();
    WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
    int end = GetTickCount();
    FILE *fp = fopen("makework.log", "a+");
    fprintf(fp, "%i,%.3f\n", num, (double)(end-start)/1000.0);
    fclose(fp);
    printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
    free(hThreads);
    free(junk);
}

int _tmain(int argc, _TCHAR* argv[])
{
    for(int i=1; i<=20; i++) {
        spawnthreads(i);
    }
    return 0;
}

I'm doing the exact same operation in each thread, so it should (ideally) take a constant ~11 seconds until I've filled up the physical cores, and then maybe double when I start using logical hyperthreaded cores. There shouldn't be any cache concerns since my loop variables and the results can fit into registers.

Here are the results of my experiment on two testbeds, both running Windows Server 2008.

Machine 1 Dual Xeon X5690 @ 3.47 GHz -- 12 physical cores, 24 logical cores, Westmere architecture

Starting 1 threads... Elapsed time: 11.575 seconds
Starting 2 threads... Elapsed time: 11.575 seconds
Starting 3 threads... Elapsed time: 11.591 seconds
Starting 4 threads... Elapsed time: 11.684 seconds
Starting 5 threads... Elapsed time: 11.825 seconds
Starting 6 threads... Elapsed time: 12.324 seconds
Starting 7 threads... Elapsed time: 14.992 seconds
Starting 8 threads... Elapsed time: 15.803 seconds
Starting 9 threads... Elapsed time: 16.520 seconds
Starting 10 threads... Elapsed time: 17.098 seconds
Starting 11 threads... Elapsed time: 17.472 seconds
Starting 12 threads... Elapsed time: 17.519 seconds
Starting 13 threads... Elapsed time: 17.395 seconds
Starting 14 threads... Elapsed time: 17.176 seconds
Starting 15 threads... Elapsed time: 16.973 seconds
Starting 16 threads... Elapsed time: 17.144 seconds
Starting 17 threads... Elapsed time: 17.129 seconds
Starting 18 threads... Elapsed time: 17.581 seconds
Starting 19 threads... Elapsed time: 17.769 seconds
Starting 20 threads... Elapsed time: 18.440 seconds

Machine 2 Dual Xeon E5-2690 @ 2.90 GHz -- 16 physical cores, 32 logical cores, Sandy Bridge architecture

Starting 1 threads... Elapsed time: 10.249 seconds
Starting 2 threads... Elapsed time: 10.562 seconds
Starting 3 threads... Elapsed time: 10.998 seconds
Starting 4 threads... Elapsed time: 11.232 seconds
Starting 5 threads... Elapsed time: 11.497 seconds
Starting 6 threads... Elapsed time: 11.653 seconds
Starting 7 threads... Elapsed time: 11.700 seconds
Starting 8 threads... Elapsed time: 11.888 seconds
Starting 9 threads... Elapsed time: 12.246 seconds
Starting 10 threads... Elapsed time: 12.605 seconds
Starting 11 threads... Elapsed time: 13.026 seconds
Starting 12 threads... Elapsed time: 13.041 seconds
Starting 13 threads... Elapsed time: 13.182 seconds
Starting 14 threads... Elapsed time: 12.885 seconds
Starting 15 threads... Elapsed time: 13.416 seconds
Starting 16 threads... Elapsed time: 13.011 seconds
Starting 17 threads... Elapsed time: 12.949 seconds
Starting 18 threads... Elapsed time: 13.011 seconds
Starting 19 threads... Elapsed time: 13.166 seconds
Starting 20 threads... Elapsed time: 13.182 seconds

Here are the aspects I find puzzling:

Why does the time elapsed with the Westmere machine stay constant til about 6 cores, then jump suddenly, and then stay basically constant above 10 threads? Is Windows stuffing all the threads in to a single processor before moving on to the second one, so that hyperthreading kicks in nondeterministically after one processor is filled?
Why does the time elapsed with the Sandy Bridge machine increase basically linearly with the number of threads until about 12? Twelve doesn't seem like a meaningful number to me considering the number of cores.

Any thoughts and suggestions on processor counters to measure/ways to improve my benchmark are appreciated. Is this an architecture problem or a Windows problem?

Edit:

As suggested below, the compiler was doing some strange things so I wrote my own assembly code that does the same thing as above but leaves all FP operations on the FP stack to avoid any memory accesses:

void makework(void *jnk) {
    register int i, j;
//  register double tmp = 0;
    __asm {
        fldz  // this holds the result on the stack
    }
    for(j=0; j<10000; j++) {
        __asm {
            fldz // push i onto the stack: stack = 0, res
        }
        for(i=0; i<1000000; i++) {
            // tmp += (double)i * (double)i;
            __asm {
                fld st(0)  // stack: i, i, res
                fld st(0)  // stack: i, i, i, res
                fmul       // stack: i*i, i, res
                faddp st(2), st(0) // stack: i, res+i*i
                fld1       // stack: 1, i, res+i*i
                fadd      // stack: i+1, res+i*i
            }
        }
        __asm {
            fstp st(0)   // pop i off the stack leaving only res in st(0)
        }
    }
    __asm {
        mov eax, dword ptr [jnk]
        fstp qword ptr [eax]
    }
//  *((double *)jnk) = tmp;
    _endthread();
}

This assembles as:

013E1002  in          al,dx  
013E1003  fldz  
013E1005  mov         ecx,2710h  
013E100A  lea         ebx,[ebx]  
013E1010  fldz  
013E1012  mov         eax,0F4240h  
013E1017  fld         st(0)  
013E1019  fld         st(0)  
013E101B  fmulp       st(1),st  
013E101D  faddp       st(2),st  
013E101F  fld1  
013E1021  faddp       st(1),st  
013E1023  dec         eax  
013E1024  jne         makework+17h (13E1017h)  
013E1026  fstp        st(0)  
013E1028  dec         ecx  
013E1029  jne         makework+10h (13E1010h)  
013E102B  mov         eax,dword ptr [jnk]  
013E102E  fstp        qword ptr [eax]  
013E1030  pop         ebp  
013E1031  jmp         dword ptr [__imp___endthread (13E20C0h)]

The results for machine 1 above are:

Starting 1 threads... Elapsed time: 12.589 seconds
Starting 2 threads... Elapsed time: 12.574 seconds
Starting 3 threads... Elapsed time: 12.652 seconds
Starting 4 threads... Elapsed time: 12.682 seconds
Starting 5 threads... Elapsed time: 13.011 seconds
Starting 6 threads... Elapsed time: 13.790 seconds
Starting 7 threads... Elapsed time: 16.411 seconds
Starting 8 threads... Elapsed time: 18.003 seconds
Starting 9 threads... Elapsed time: 19.220 seconds
Starting 10 threads... Elapsed time: 20.124 seconds
Starting 11 threads... Elapsed time: 20.764 seconds
Starting 12 threads... Elapsed time: 20.935 seconds
Starting 13 threads... Elapsed time: 20.748 seconds
Starting 14 threads... Elapsed time: 20.717 seconds
Starting 15 threads... Elapsed time: 20.608 seconds
Starting 16 threads... Elapsed time: 20.685 seconds
Starting 17 threads... Elapsed time: 21.107 seconds
Starting 18 threads... Elapsed time: 21.451 seconds
Starting 19 threads... Elapsed time: 22.043 seconds
Starting 20 threads... Elapsed time: 22.745 seconds

So it's about 9% slower with one thread (the difference between inc eax vs. fld1 and faddp, perhaps?), and when all the physical cores are filled it's almost twice as slow (which would be expected from hyperthreading). But, the puzzling aspect of degraded performance starting at only 6 threads still remains...

Very (very) random idea: The scheduler is shuffling the threads around cores needlessly. Do the results look the same if you manually set CPU affinity for each thread? — us2012
Well, it is not a scheduler problem. You'd expect a signal again at 12 and 18 threads. The Xeon really does have only 6 cores. I'd guess at a numa setup with two chips, you see the interconnect overhead. — Hans Passant
I would recommend switching to CreatedThread, that was working fine for decades. _beginthread is relatively new thing.... Who knows. — Kirill Kobelev
@KirillKobelev Relative to what?? _beginthread has been part of the MSVC runtime for near twenty years. And for a very long time after release, if your program used any functionality of the runtime library it wasn't even optional; it was mandatory. It is the responsible party for setting up all CRT-based TLS-held data. It surely isn't new, relatively or otherwise. And it should have no effect beyond startup costs of a thread's performance. — WhozCraig

Andrew Andrew · Accepted Answer · 2013-09-16T18:18:13

Now to be totally lame and answer my own question -- it seems to be the scheduler as @us2012 suggested. I hardcoded the affinity masks to fill up physical cores first, then switch to hyperthreaded cores:

void spawnthreads(int num) {
    ULONG_PTR masks[] = {  // for my system; YMMV
        0x1, 0x4, 0x10, 0x40, 0x100, 0x400, 0x1000, 0x4000, 0x10000, 0x40000, 
        0x100000, 0x400000, 0x2, 0x8, 0x20, 0x80, 0x200, 0x800, 0x2000, 0x8000};
    HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
    double *junk = (double *)malloc(num*sizeof(double));
    printf("Starting %i threads... ", num);
    for(int i=0; i<num; i++) {
        hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
        SetThreadAffinityMask(hThreads[i], masks[i]);
    }
    int start = GetTickCount();
    WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
    int end = GetTickCount();
    FILE *fp = fopen("makework.log", "a+");
    fprintf(fp, "%i,%.3f,%f\n", num, (double)(end-start)/1000.0, junk[0]);
    fclose(fp);
    printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
    free(hThreads);
}

and get

Starting 1 threads... Elapsed time: 12.558 seconds
Starting 2 threads... Elapsed time: 12.558 seconds
Starting 3 threads... Elapsed time: 12.589 seconds
Starting 4 threads... Elapsed time: 12.652 seconds
Starting 5 threads... Elapsed time: 12.621 seconds
Starting 6 threads... Elapsed time: 12.777 seconds
Starting 7 threads... Elapsed time: 12.636 seconds
Starting 8 threads... Elapsed time: 12.886 seconds
Starting 9 threads... Elapsed time: 13.057 seconds
Starting 10 threads... Elapsed time: 12.714 seconds
Starting 11 threads... Elapsed time: 12.777 seconds
Starting 12 threads... Elapsed time: 12.668 seconds
Starting 13 threads... Elapsed time: 26.489 seconds
Starting 14 threads... Elapsed time: 26.505 seconds
Starting 15 threads... Elapsed time: 26.505 seconds
Starting 16 threads... Elapsed time: 26.489 seconds
Starting 17 threads... Elapsed time: 26.489 seconds
Starting 18 threads... Elapsed time: 26.676 seconds
Starting 19 threads... Elapsed time: 26.770 seconds
Starting 20 threads... Elapsed time: 26.489 seconds

which is as expected. Now the question is, what OS settings can I twiddle to make this closer to the default behavior since most of our code is written in MATLAB...

Strange Multithreading Performance

4 Answers