I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX:
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2.
I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE:
//sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication)
sum = _mm_set1_ps(0.0f);
a1 = _mm_set1_ps(a[0]);
b1 = _mm_load_ps(&b[0]);
sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1));
a2 = _mm_set1_ps(a[1]);
b2 = _mm_load_ps(&b[4]);
sum = _mm_add_ps(sum, _mm_mul_ps(a2, b2));
a3 = _mm_set1_ps(a[2]);
b3 = _mm_load_ps(&b[8]);
sum = _mm_add_ps(sum, _mm_mul_ps(a3, b3));
...
My question is how does this get converted to simultaneous multiply and add? Can the data be dependent? I mean can the CPU do _mm_add_ps(sum, _mm_mul_ps(a1, b1))
simultaneously or do the registers used in the multiplication and add have to be independent?
Lastly how does this apply to FMA (with Haswell)? Is _mm_add_ps(sum, _mm_mul_ps(a1, b1))
automatically converted to a single FMA instruction or micro-operation?