Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
Just working out the units gives cycles²/instr
, which is strange and I have no interpretation for it.
The throughput listed here is really a reciprocal throughput, in CPI, so 0.5 cycles per instruction or 2 instructions per cycle. These numbers are related by being each others reciprocal, the latency has nothing to do with it.
There is a related calculation that does involve both latency and (reciprocal) throughput, namely the product of the latency and the throughput: 4 * 2 = 8 (in units of "number of instructions"). This is how many independent instances of the operation can be "in flight" (started but not completed) simultaneously, comparable with the bandwidth-delay product in network theory. This number informs some code design decisions, because it is a lower bound on the amount of instruction-level parallelism the code needs to expose to the CPU in order for it to fully use the computation resources.