I've been reading up on the recently available AVX-512 instructions, and I feel like there is a basic concept that I'm not understanding. What is the benefit of SIMD on a superscalar CPU that already performs out of order execution?
Consider the following pseudo assembly code. With SIMD:
load 16 floats to register simd-a
load 16 floats to register simd-b
multiply register simd-a by simd-b as 16 floats to register c
store the results to memory
And this without SIMD:
load a float to register a
load a float to register b
multiply register a and register b as floats to c
store register c to memory
load a float to register a (contiguous to prior load to a)
load a float to register b (contiguous to prior load to b)
multiply register a and register b as floats to c
store register c to memory (contiguous to previous stored result)
[continued for 16 floats]
It's been a while since I've done low-level work like this, but it seems to me that the CPU could convert the non-SIMD example to run like this in data order:
- 32 load instructions processed in parallel (likely as just two requests to cache/memory if memory is properly aligned)
- 16 multiply instructions executed in parallel once the loads complete
- 16 stores to memory which again would be only a single request to cache/memory if things are properly aligned
Essentially, it feels like the CPU could be intelligent enough to perform at the same speed in both cases. Obviously there's something I'm missing here as we continue to add additional and wider SIMD instructions to ISAs, so where does the practical value come from for these type of instructions?