What is the benefit of SIMD on a superscalar out-of-order CPU?

Question

I've been reading up on the recently available AVX-512 instructions, and I feel like there is a basic concept that I'm not understanding. What is the benefit of SIMD on a superscalar CPU that already performs out of order execution?

Consider the following pseudo assembly code. With SIMD:

load 16 floats to register simd-a
load 16 floats to register simd-b
multiply register simd-a by simd-b as 16 floats to register c
store the results to memory

And this without SIMD:

load a float to register a
load a float to register b
multiply register a and register b as floats to c
store register c to memory

load a float to register a (contiguous to prior load to a)
load a float to register b (contiguous to prior load to b)
multiply register a and register b as floats to c
store register c to memory (contiguous to previous stored result)

[continued for 16 floats]

It's been a while since I've done low-level work like this, but it seems to me that the CPU could convert the non-SIMD example to run like this in data order:

32 load instructions processed in parallel (likely as just two requests to cache/memory if memory is properly aligned)
16 multiply instructions executed in parallel once the loads complete
16 stores to memory which again would be only a single request to cache/memory if things are properly aligned

Essentially, it feels like the CPU could be intelligent enough to perform at the same speed in both cases. Obviously there's something I'm missing here as we continue to add additional and wider SIMD instructions to ISAs, so where does the practical value come from for these type of instructions?

SIMD and OOO provide two different kind of parallelism: SIMD is data parallelism and OOO is instruction-level parallelism (ILP). The two concepts are orthogonal. Compared to a non-SIMD design offering the same throughput, SIMD simplifies the hardware, e.g. decode bandwidth and register file access. Note that the question pertains to computer architecture choices / tradeoffs, and is not really a programming question. I vote to close it based on that reason. — njuffa
@njuffa, Do ILP and DLP work with each other or they are unmixable? — Martin
@FackedDeveloper They are orthogonal. You can have neither, one or the other, or both. Modern x86 processors support both SIMD and OOO execution, as well as additional forms of parallelism. e.g. hyperthreading and multi-core processors. — njuffa
@njuffa, do you think it would be reasonable to say that SIMD is a special case of data parallelism and that ILP is a special case of task parallelism? Also SMT (e.g. hyper-threading) is task parallelism but it's really a special case of ILP as well I think. — Z boson

harold harold · Accepted Answer · 2017-03-14T19:39:52

The difference is mainly the feasibility of realizing such a design in hardware. Superscalar architectures aren't very scalable for various reasons. For example, it would be difficult to rename that many registers in one cycle, because the things you're renaming might be dependent (if it was really translated SIMD code they wouldn't be, but you can't know that). The physical register file would need a boatload of extra read and write ports, that's pretty annoying. Wider registers by contrast are easy. The forwarding network would explode in size. A lot of µops would have to be inserted into the active window every cycle, a lot of them would have to be woken up and dispatched, and a lot of them have to retire. Since the machine is now being flooded with an order of magnitude more µops, you'd probably want to support a bigger active window, otherwise it has effectively become smaller (for equivalent code it becomes less effective).

The whole memory business is harder too, since now you'd have to support a lot of accesses in a cycle (that all have to go through separate translations, have ordering constraints applied to them, participate in forwarding, and so forth), instead of just wider accesses (which is relatively easy).

Basically this hypothetical design takes a lot of things that are already hard to implement efficiently with a reasonable power and area budget, and then makes them even harder. The complexity of many of those things scales approximately quadratically with the number of µops that you want to put through them in a cycle, not linearly.

Adding wider SIMD, they way they have been doing, is largely just copy-pasting the SIMD unit (hence the annoying semantics of most AVX and AVX2 instructions) and giving some things a higher bit-width. There is no bad scaling if you do it that way.

What is the benefit of SIMD on a superscalar out-of-order CPU?

1 Answers