i'm tring to understand how can i max out the number of operations i can get on my CPU. I'm doing a simple matrix multiplication program, and i have a Skylake processor. I was looking at the wikipedia page for the flops information on this architecture, and i'm having dificulties understanding it.
From my understanding, FMA instructions allow 3 way FP inputs right? And allow to mix between adds and multiplies between them. But what happens when i only add two floats? Does it simply multiply it by one? Can i add 3 floats in 1 cycle, or will that be split? I saw that the skylake, has 32 FLOPs/cycle for single precision inputs, but what's the meaning of "two 8-wide FMA instructions"?
Thank you in advance for the explanations