I am investigating how many FLOPs could be done in one CPU cycles using gotoblas library. I used 32-bit floating point number to run a matrix multiplication, and got roughly 8 FLOPs per CPU cycle by hand calculation. I guess this may be because there are two FPUs in my processor (Intel Xeon E5430), each of which takes care of one SSE instruction over 128-bit XMM registers. Therefore, using 32-bit floating point numbers, I got 2*4 FLOPs per CPU cycle.
Is my guess correct? Is there an official manual I can refer to get the number of FPUs in one Intel processor?
Thanks!