2
votes

I'm trying to use the Intel FMA intrinsics like _mm_fmadd_ps (__m128 a, __m128 b, __m128 c) in order to get better performance in my code.

So, first of all, i did a little test program to see what it can do and how can I possibly use them.

#include <stdio.h>
#include <stdlib.h>
#include "xmmintrin.h"

int main()
{
   __m128 v1,v2,v3,vr;
   v1 = _mm_set_ps (5.0, 5.0, 5.0, 5.0);
   v2 = _mm_set_ps (2.0, 2.0, 2.0, 2.0);
   v3 = _mm_set_ps (3.0, 3.0, 3.0, 3.0);

   vr = _mm_fmadd_ps (v1, v2, v3);
}

and i've got this error :

vr = error: incompatible types when assigning to type ‘__m128’ from type ‘int’ vr = _mm_fmadd_ps (v1, v2, v3);

I thought it was probably the processor capabilities is not allowing the use of such instructions so I looked on the internet for my processor model (Intel® Core™ i7-4700MQ Processor) and I found out that it supports only SSE4.1/4.2, AVX 2.0 intrinsics which was a little bit weird for me!! So I looked in the proc/cpuinfo file and the flags section I found the ** fma ** flag. This is the confusing part about the hardware.

As for the software, i've used this makefile option after some digging on the internet and I hope it's not the issue.

CC=gcc
CFLAGS=-g -c -Wall -O2 -mavx2 -mfma 

And I'm using eclipse on a Ubuntu 12.04 LTS with a GCC version 4.9.4 Thank you.

2
That is a compiler error. The code hasn't even started running yet, so it cannot possibly be lack of support from your chip.Cody Gray
@PaulR : it worked .... Thank youA.nechi
Note that this code does nothing useful, so when you compile it with optimizations enabled (-O2), the compiler elides all this code and simply emits code to return 0 from main (demo). So it'll run real fast. :-)Cody Gray
@CodyGray: true - making vr volatile fixes this though, if you just want to see the generated code.Paul R

2 Answers

3
votes

One of the quirks of C is that the language indicates that the compiler is to assume a symbol it's not seen before must return int if you call it like a function. Since you didn't include the header that actually defines the signature for _mm_fmadd_ps, you get the strange error about converting int to __m128.

The original organization of the intrinsics headers was to have a unique header per instruction generations, so you had:

mmintrin.h     The original MMX instruction set (deprecated for x64 native)
mm3dnow.h      The AMD 3D Now! instruction set (deprecated for x64 native)
emmintrin.h    SSE (i.e. single-precision 4-wide SIMD)
xmmintrin.h    SSE2 (i.e. double-precision and integer 4-wide SIMD)

After that, they started using the code names of the processor architecture where the new instructions were introduced.

pmmintrin.h    SSE3 (the p stands for Prescott)
tmmintrin.h    Supplemental SSE3 (the t stands for Tejas)
smmintrin.h    SSE4.1 (not sure what the s is here for.
               They were added for Penryn but p
               was already used for Prescott)
nmmintrin.h    SSE4.2 (the n stands for Nehalem)
wmmintrin.h    AES (the w stands for Westmere)

These days the new instruction sets tend to end up in either ammintrin.h for AMD-originated stuff (ABM, BMI, LWP, TBM, XOP, FMA4, SSE4a, SSE5) or immintrin.h for Intel-originated stuff (AVX, FMA3, F16C, AVX2, etc.). AVX-512 is in zmmintrin.h.

The older system wasn't particularly intuitive, but neither is the new one. A number of AMD instruction subsets are defined in immintrin.h because they are the same instruction. Looking it up in the documentation or the header is really the only way to know which intrinsic is where.

For Intel this website is a good reference. Otherwise you need to see the developer guides for AMD and/or Intel.

You might find this blog series of mine useful.

1
votes

The -mfma might seem like a bit of a bother, but it's there for good reason. The result of

_mm_add_ps(_mm_mul_ps(a, b), c)
_mm_fmadd_ps(a, b, c)

Actually differ. If you are writing code that must compute the exact same results on all the machines you run the code on (determinism), then you will probably need to disable fma! That's basically why you need to enable it in the build with -fma.

Still, at least it's not as bad as the six compile flags you'll need for avx512 enabled SkyLake-X CPUs :(