0
votes

I'm trying to understand the difference between Vector Processor and SIMD architectures such as ARM NEON. I know that there is a difference in vector register length configurability between these two. However, I'm not sure how their microarchitecture can be different? Is it the case that for SIMD machines we need to have as many processing units as the number of elements each instruction operate on? Or just like vector processors, we can have lesser number of processing units than the number of data elements in a vector register and just need to use a sequencer to complete an instruction in multiple cycles?

Thanks

2

2 Answers

2
votes

You can implement short-vector SIMD (like NEON or x86 SSE) with narrower hardware that has to decode each instruction to 2 internal operations, for example.

Intel did this with 128-bit SSE vectors on Pentium 3 through Pentium M, with Pentium 4 and Core 2 being the first microarchitectures to have full-width SIMD execution units.

But the decoding is not data-dependent so you don't need a full microcode sequencer.

1
votes

the difference between Vector Processor and SIMD

I dunno your definition of vector processor, but wikipedia says SIMD is one type of them.

Is it the case that for SIMD machines we need to have as many processing units as the number of elements each instruction operate on?

Some CPUs split SIMD register into parts, and process them independently. Intel Pentium III split 128-bit SSE operations in 64-bit pieces, AMD Zen does the same with 256-bit AVX instructions, splits them into 128-bit pieces.

need to use a sequencer to complete an instruction in multiple cycles?

Just because they’re split doesn’t mean they run sequentially. All modern CPUs, including ARM, have multiple execution units (EUs) per core. Micro-ops can run in parallel on different EUs, but these EUs aren’t equal. Since I mentioned AMD Zen, here’s a link. The core can start executing up to 10 different micro-ops per cycle: 4 integers (all can do add or bitwise, 2 of them can multiply/divide, 2 of them can branch), 2 integer load/stores, 4 128-bit floating point operations (two can add, other two can multiply, two can AES encrypt). It can finish up to 16 instructions/cycle, 8 integers, 8 floats. Different micro-ops take different count of cycles to complete.