I'm trying to get to the bottom of some rather disappointing performance results we've been getting for our HPC applications. I wrote the following benchmark in Visual Studio 2010 that distills the essence of our applications (lots of independent, high arithmetic intensity operations):
#include "stdafx.h"
#include <math.h>
#include <time.h>
#include <Windows.h>
#include <stdio.h>
#include <memory.h>
#include <process.h>
void makework(void *jnk) {
double tmp = 0;
for(int j=0; j<10000; j++) {
for(int i=0; i<1000000; i++) {
tmp = tmp+(double)i*(double)i;
}
}
*((double *)jnk) = tmp;
_endthread();
}
void spawnthreads(int num) {
HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
double *junk = (double *)malloc(num*sizeof(double));
printf("Starting %i threads... ", num);
for(int i=0; i<num; i++) {
hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
}
int start = GetTickCount();
WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
int end = GetTickCount();
FILE *fp = fopen("makework.log", "a+");
fprintf(fp, "%i,%.3f\n", num, (double)(end-start)/1000.0);
fclose(fp);
printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
free(hThreads);
free(junk);
}
int _tmain(int argc, _TCHAR* argv[])
{
for(int i=1; i<=20; i++) {
spawnthreads(i);
}
return 0;
}
I'm doing the exact same operation in each thread, so it should (ideally) take a constant ~11 seconds until I've filled up the physical cores, and then maybe double when I start using logical hyperthreaded cores. There shouldn't be any cache concerns since my loop variables and the results can fit into registers.
Here are the results of my experiment on two testbeds, both running Windows Server 2008.
Machine 1 Dual Xeon X5690 @ 3.47 GHz -- 12 physical cores, 24 logical cores, Westmere architecture
Starting 1 threads... Elapsed time: 11.575 seconds
Starting 2 threads... Elapsed time: 11.575 seconds
Starting 3 threads... Elapsed time: 11.591 seconds
Starting 4 threads... Elapsed time: 11.684 seconds
Starting 5 threads... Elapsed time: 11.825 seconds
Starting 6 threads... Elapsed time: 12.324 seconds
Starting 7 threads... Elapsed time: 14.992 seconds
Starting 8 threads... Elapsed time: 15.803 seconds
Starting 9 threads... Elapsed time: 16.520 seconds
Starting 10 threads... Elapsed time: 17.098 seconds
Starting 11 threads... Elapsed time: 17.472 seconds
Starting 12 threads... Elapsed time: 17.519 seconds
Starting 13 threads... Elapsed time: 17.395 seconds
Starting 14 threads... Elapsed time: 17.176 seconds
Starting 15 threads... Elapsed time: 16.973 seconds
Starting 16 threads... Elapsed time: 17.144 seconds
Starting 17 threads... Elapsed time: 17.129 seconds
Starting 18 threads... Elapsed time: 17.581 seconds
Starting 19 threads... Elapsed time: 17.769 seconds
Starting 20 threads... Elapsed time: 18.440 seconds
Machine 2 Dual Xeon E5-2690 @ 2.90 GHz -- 16 physical cores, 32 logical cores, Sandy Bridge architecture
Starting 1 threads... Elapsed time: 10.249 seconds
Starting 2 threads... Elapsed time: 10.562 seconds
Starting 3 threads... Elapsed time: 10.998 seconds
Starting 4 threads... Elapsed time: 11.232 seconds
Starting 5 threads... Elapsed time: 11.497 seconds
Starting 6 threads... Elapsed time: 11.653 seconds
Starting 7 threads... Elapsed time: 11.700 seconds
Starting 8 threads... Elapsed time: 11.888 seconds
Starting 9 threads... Elapsed time: 12.246 seconds
Starting 10 threads... Elapsed time: 12.605 seconds
Starting 11 threads... Elapsed time: 13.026 seconds
Starting 12 threads... Elapsed time: 13.041 seconds
Starting 13 threads... Elapsed time: 13.182 seconds
Starting 14 threads... Elapsed time: 12.885 seconds
Starting 15 threads... Elapsed time: 13.416 seconds
Starting 16 threads... Elapsed time: 13.011 seconds
Starting 17 threads... Elapsed time: 12.949 seconds
Starting 18 threads... Elapsed time: 13.011 seconds
Starting 19 threads... Elapsed time: 13.166 seconds
Starting 20 threads... Elapsed time: 13.182 seconds
Here are the aspects I find puzzling:
Why does the time elapsed with the Westmere machine stay constant til about 6 cores, then jump suddenly, and then stay basically constant above 10 threads? Is Windows stuffing all the threads in to a single processor before moving on to the second one, so that hyperthreading kicks in nondeterministically after one processor is filled?
Why does the time elapsed with the Sandy Bridge machine increase basically linearly with the number of threads until about 12? Twelve doesn't seem like a meaningful number to me considering the number of cores.
Any thoughts and suggestions on processor counters to measure/ways to improve my benchmark are appreciated. Is this an architecture problem or a Windows problem?
Edit:
As suggested below, the compiler was doing some strange things so I wrote my own assembly code that does the same thing as above but leaves all FP operations on the FP stack to avoid any memory accesses:
void makework(void *jnk) {
register int i, j;
// register double tmp = 0;
__asm {
fldz // this holds the result on the stack
}
for(j=0; j<10000; j++) {
__asm {
fldz // push i onto the stack: stack = 0, res
}
for(i=0; i<1000000; i++) {
// tmp += (double)i * (double)i;
__asm {
fld st(0) // stack: i, i, res
fld st(0) // stack: i, i, i, res
fmul // stack: i*i, i, res
faddp st(2), st(0) // stack: i, res+i*i
fld1 // stack: 1, i, res+i*i
fadd // stack: i+1, res+i*i
}
}
__asm {
fstp st(0) // pop i off the stack leaving only res in st(0)
}
}
__asm {
mov eax, dword ptr [jnk]
fstp qword ptr [eax]
}
// *((double *)jnk) = tmp;
_endthread();
}
This assembles as:
013E1002 in al,dx
013E1003 fldz
013E1005 mov ecx,2710h
013E100A lea ebx,[ebx]
013E1010 fldz
013E1012 mov eax,0F4240h
013E1017 fld st(0)
013E1019 fld st(0)
013E101B fmulp st(1),st
013E101D faddp st(2),st
013E101F fld1
013E1021 faddp st(1),st
013E1023 dec eax
013E1024 jne makework+17h (13E1017h)
013E1026 fstp st(0)
013E1028 dec ecx
013E1029 jne makework+10h (13E1010h)
013E102B mov eax,dword ptr [jnk]
013E102E fstp qword ptr [eax]
013E1030 pop ebp
013E1031 jmp dword ptr [__imp___endthread (13E20C0h)]
The results for machine 1 above are:
Starting 1 threads... Elapsed time: 12.589 seconds
Starting 2 threads... Elapsed time: 12.574 seconds
Starting 3 threads... Elapsed time: 12.652 seconds
Starting 4 threads... Elapsed time: 12.682 seconds
Starting 5 threads... Elapsed time: 13.011 seconds
Starting 6 threads... Elapsed time: 13.790 seconds
Starting 7 threads... Elapsed time: 16.411 seconds
Starting 8 threads... Elapsed time: 18.003 seconds
Starting 9 threads... Elapsed time: 19.220 seconds
Starting 10 threads... Elapsed time: 20.124 seconds
Starting 11 threads... Elapsed time: 20.764 seconds
Starting 12 threads... Elapsed time: 20.935 seconds
Starting 13 threads... Elapsed time: 20.748 seconds
Starting 14 threads... Elapsed time: 20.717 seconds
Starting 15 threads... Elapsed time: 20.608 seconds
Starting 16 threads... Elapsed time: 20.685 seconds
Starting 17 threads... Elapsed time: 21.107 seconds
Starting 18 threads... Elapsed time: 21.451 seconds
Starting 19 threads... Elapsed time: 22.043 seconds
Starting 20 threads... Elapsed time: 22.745 seconds
So it's about 9% slower with one thread (the difference between inc eax vs. fld1 and faddp, perhaps?), and when all the physical cores are filled it's almost twice as slow (which would be expected from hyperthreading). But, the puzzling aspect of degraded performance starting at only 6 threads still remains...
CreatedThread
, that was working fine for decades. _beginthread is relatively new thing.... Who knows. – Kirill Kobelev_beginthread
has been part of the MSVC runtime for near twenty years. And for a very long time after release, if your program used any functionality of the runtime library it wasn't even optional; it was mandatory. It is the responsible party for setting up all CRT-based TLS-held data. It surely isn't new, relatively or otherwise. And it should have no effect beyond startup costs of a thread's performance. – WhozCraig