I'm having a hard time understanding how the theoretical Instructions per Cycle (IPC) for a Fermi architecture nvidia GPU is 2, according to http://on-demand.gputechconf.com/gtc-express/2011/presentations/Inst_limited_kernels_Oct2011.pdf page 9.
According to section 5.4.1 of the programming guide (http://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions) for 32-bit floats, there can be 32 fp32-instructions/SM/clock cycle.
How do the two quantities relate?