I have a code, which i launch on Intel Xeon Phi Knights Landing (KNL) 7210 (64 cores) processor (it is a PC, in native mode) and use the Intel c++ compiler (icpc) version 17.0.4. Also i launch the same code on Intel core i7 processor, where the version of icpc is 17.0.1. To be more correct, i compile the code on the machine i'm launching it (compiled on i7 and launched on i7, the same for KNL). I never make the binary file on one machine and bring it to another. The loops are parallelized and vectorized using OpenMP. For best performance i use the intel compiler flags:
-DCMAKE_CXX_COMPILER="-march=native -mtune=native -ipo16 -fp-model fast=2 -O3 -qopt-report=5 -mcmodel=large"
On i7 everything works well. But on KNL the code works withous -march=native
and if to add this option the program throws floating point exception immediately. If to compile with the only flag "-march=native" the situation is the same. If to use gdb, it points at the line pp+=alpha/rd
of the piece of code:
...
the code above is run in 1 thread
double K1=0.0, P=0.0;
#pragma omp parallel for reduction(+:P_x,P_y,P_z, K1,P)
for(int i=0; i<N; ++i)
{
P_x+=p[i].vx*p[i].m;
P_y+=p[i].vy*p[i].m;
P_z+=p[i].vz*p[i].m;
K1+=p[i].vx*p[i].vx+p[i].vy*p[i].vy+p[i].vz*p[i].vz;
float pp=0.0;
#pragma simd reduction(+:pp)
for(int j=0; j<N; ++j) if(i!=j)
{
float rd=sqrt((p[i].x-p[j].x)*(p[i].x-p[j].x)+(p[i].y-p[j].y)*(p[i].y-p[j].y)+(p[i].z-p[j].z)*(p[i].z-p[j].z));
pp+=alpha/rd;
}
P+=pp;
}
...
Particle p[N];
- an array of particles, Particle is a structure of floats. N - maximal number of particles.
If to remove the flag -march=native
or replace it with -march=knl
or with -march=core-avx2
, everything woks OK. This flag is doing something bad to the program, but what - I don't know.
I found in the Internet (https://software.intel.com/en-us/articles/porting-applications-from-knights-corner-to-knights-landing, https://math-linux.com/linux/tip-of-the-day/article/intel-compilation-for-mic-architecture-knl-knights-landing) that one should use the flags: -xMIC-AVX512
. I tried to use this flag and -axMIC-AVX512
, but they give the same error.
So, what i wanted to ask is:
Why
-march=native
,-xMIC-AVX512
do not work and-march=knl
works; is-xMIC-AVX512
included in-march=native
flag for KNL?May I replace the flag
-march=native
with-march=knl
when I launch the code on KNL (on i7 everything works), are they equivalent?Is the set of flags written optimal for the best performance if using Intel compiler?
As, Peter Cordes told, i placed here the assembeler output when the program throws Floating Point Exception in GDB: 1) the output of (gdb) disas:
Program received signal SIGFPE, Arithmetic exception.
0x000000000040e3cc in randomizeBodies() ()
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-
16.el7.x86_64 libstdc++-4.8.5-16.el7.x86_64
(gdb) disas
Dump of assembler code for function _Z15randomizeBodiesv:
0x000000000040da70 <+0>: push %rbp
0x000000000040da71 <+1>: mov %rsp,%rbp
0x000000000040da74 <+4>: and $0xffffffffffffffc0,%rsp
0x000000000040da78 <+8>: sub $0x100,%rsp
0x000000000040da7f <+15>: vpxor %xmm0,%xmm0,%xmm0
0x000000000040da83 <+19>: vmovups %xmm0,(%rsp)
0x000000000040da88 <+24>: vxorpd %xmm5,%xmm5,%xmm5
0x000000000040da8c <+28>: vmovq %xmm0,0x10(%rsp)
0x000000000040da92 <+34>: mov $0x77359400,%ecx
0x000000000040da97 <+39>: xor %eax,%eax
0x000000000040da99 <+41>: movabs $0x5deece66d,%rdx
0x000000000040daa3 <+51>: mov %ecx,%ecx
0x000000000040daa5 <+53>: imul %rdx,%rcx
0x000000000040daa9 <+57>: add $0xb,%rcx
0x000000000040daad <+61>: mov %ecx,0x9a3b00(,%rax,8)
0x000000000040dab4 <+68>: mov %ecx,%esi
0x000000000040dab6 <+70>: imul %rdx,%rsi
0x000000000040daba <+74>: add $0xb,%rsi
0x000000000040dabe <+78>: mov %esi,0x9e3d00(,%rax,8)
0x000000000040dac5 <+85>: mov %esi,%edi
0x000000000040dac7 <+87>: imul %rdx,%rdi
0x000000000040dacb <+91>: add $0xb,%rdi
0x000000000040dacf <+95>: mov %edi,0xa23f00(,%rax,8)
0x000000000040dad6 <+102>: mov %edi,%r8d
0x000000000040dad9 <+105>: imul %rdx,%r8
0x000000000040dadd <+109>: add $0xb,%r8
0x000000000040dae1 <+113>: mov %r8d,0xa64100(,%rax,8)
0x000000000040dae9 <+121>: mov %r8d,%r9d
0x000000000040daec <+124>: imul %rdx,%r9
0x000000000040daf0 <+128>: add $0xb,%r9
0x000000000040daf4 <+132>: mov %r9d,0xaa4300(,%rax,8)
0x000000000040dafc <+140>: mov %r9d,%r10d
0x000000000040daff <+143>: imul %rdx,%r10
0x000000000040db03 <+147>: add $0xb,%r10
0x000000000040db07 <+151>: mov %r10d,0x9a3b04(,%rax,8)
0x000000000040db0f <+159>: mov %r10d,%r11d
0x000000000040db12 <+162>: imul %rdx,%r11
0x000000000040db16 <+166>: add $0xb,%r11
0x000000000040db1a <+170>: mov %r11d,0x9e3d04(,%rax,8)
0x000000000040db22 <+178>: mov %r11d,%ecx
0x000000000040db25 <+181>: imul %rdx,%rcx
0x000000000040db29 <+185>: add $0xb,%rcx
0x000000000040db2d <+189>: mov %ecx,0xa23f04(,%rax,8)
2) the output of p $mxcsr:
(gdb) p $mxcsr
1 = [ ZE PE DAZ DM PM FZ ]
3) the output of p $ymm0.v8_float:
$2 = {3, 3, 3, 3, 3, 3, 3, 3}
4) the output of p $zmm0.v16_float:
gdb) p $zmm0.v16_float
$3 = {3 <repeats 16 times>}.
I shoud also mention that to detect floating point exceptions i used the standard
void handler(int sig)
{
printf("Floating Point Exception\n");
exit(0);
}
...
int main(int argc, char **argv)
{
feenableexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW);
signal(SIGFPE, handler);
...
}
I should stress that i have already been using feenableexcept when i got this error. I used it since the begin of program debugging because we had the errors (Floating Point Exceptions) in code and had to correct them.
-march=native
is the same as compiling with-march=skylake
or whatever it is. Native means to make code that assumes it's running on the same machine that compiled it, so you shouldn't expect it to work on other machines. – Peter Cordesrd == 0.0
at that point or something? Do you have FP exceptions unmasked on your KNL system? Different compiler options can produce different FP behaviour (Intel's compiler enabled the equivalent of-ffast-math
so it's probably using AVX512ER (KNL-only) VRSQRT28PS to get a highish-precision fast approximation recip sqrt, much better than thevrsqrt14ps
from plain AVX512, or 12-bit from plain SSE/AVX1vrsqrtps
. – Peter Cordes