Why does Intel hide internal RISC core in their processors?

votes

Starting with Pentium Pro (P6 microarchitecture), Intel redesigned it's microprocessors and used internal RISC core under the old CISC instructions. Since Pentium Pro all CISC instructions are divided into smaller parts (uops) and then executed by the RISC core.

At the beginning it was clear for me that Intel decided to hide new internal architecture and force programmers to use "CISC shell". Thanks to this decision Intel could fully redesign microprocessors architecture without breaking compatibility, it's reasonable.

However I don't understand one thing, why Intel still keeps an internal RISC instructions set hidden for so many years? Why wouldn't they let programmers use RISC instructions like the use old x86 CISC instructions set?

If Intel keeps backward compatibility for so long (we still have virtual 8086 mode next to 64 bit mode), Why don't they allow us compile programs so they will bypass CISC instructions and use RISC core directly? This will open natural way to slowly abandon x86 instructions set, which is deprecated nowadays (this is the main reason why Intel decided to use RISC core inside, right?).

Looking at new Intel 'Core i' series I see, that they only extends CISC instructions set adding AVX, SSE4 and others.

assemblyx86intelcpu-architecture

note that there are certain x86 CPUs where the internal RISC instruction set is exposed – phuclv

6 Answers

votes

No, the x86 instruction set is certainly not deprecated. It is as popular as ever. The reason Intel uses a set of RISC-like micro-instructions internally is because they can be processed more efficiently.

So a x86 CPU works by having a pretty heavy-duty decoder in the frontend, which accepts x86 instructions, and converts them to an optimized internal format, which the backend can process.

As for exposing this format to "external" programs, there are two points:

it is not a stable format. Intel can change it between CPU models to best fit the specific architecture. This allows them to maximize efficiency, and this advantage would be lost if they had to settle on a fixed, stable instruction format for internal use as well as external use.
there's just nothing to be gained by doing it. With today's huge, complex CPU's, the decoder is a relatively small part of the CPU. Having to decode x86 instructions makes that more complex, but the rest of the CPU is unaffected, so overall, there's just very little to be gained, especially because the x86 frontend would still have to be there, in order to execute "legacy" code. So you wouldn't even save the transistors currently used on the x86 frontend.

This isn't quite a perfect arrangement, but the cost is fairly small, and it's a much better choice than designing the CPU to support two completely different instruction sets. (In that case, they'd probably end up inventing a third set of micro-ops for internal use, just because those can be tweaked freely to best fit the CPU's internal architecture)

votes

The real answer is simple.

The major factor behind the implementation of RISC processors was to reduce complexity and gain speed. The downside of RISC is the reduced instruction density, that means that the same code expressed in RISC like format needs more instructions than the equivalent CISC code.

This side effect doesnt means much if your CPU runs at the same speed as the memory, or at least if they both run at reasonably similar speeds.

Currently the memory speed compared to the CPU speed shows a big difference in clocks. Current CPUs are sometimes five times or more faster than the main memory.

This state of the technology favours a more dense code, something that CISC provides.

You can argue that caches could speed up RISC CPUs. But the same can be said about CISC cpus.

You get a bigger speed improvement by using CISC and caches than RISC and caches, because the same size cache has more effect on high density code that CISC provides.

Another side effect is that RISC is harder on compiler implementation. Its easier to optimize compilers for CISC cpus. etc.

Intel knows what they are doing.

This is so true that ARM has a higher code density mode called Thumb.

votes

If Intel keeps backward compatibility for so long (we still have virtual 8086 mode next to 64 bit mode), Why don't they allow us compile programs so they will bypass CISC instructions and use RISC core directly? This will open natural way to slowly abandon x86 instructions set, which is deprecated nowadays (this is the main reason why Intel decided to use RISC core inside, right?).

You need to look at the business angle of this. Intel has actually tried to move away from x86, but it's the goose that lays golden eggs for the company. XScale and Itanium never came even close to the level of success that their core x86 business has.

What you're basically asking is for Intel to slit its wrists in exchange for warm fuzzies from developers. Undermining x86 is not in their interests. Anything that makes more developers not have to choose to target x86 undermines x86. That, in turn, undermines them.

votes

Via C3 processors do allow something like this, after enabling it via an MSR and executing an undocumented 0F 3F instruction to activate the https://en.wikipedia.org/wiki/Alternate_Instruction_Set which doesn't enforce the usual privileged (ring 0) vs. unprivileged (ring 3) protections. (Unfortunately Via Samuel II shipped with the MSR setting to allow this defaulting to allowed. And they didn't document it, so OSes didn't know they should turn off that capability. Other Via CPUs default to disabled.)

See Christopher Domas's talk from DEF CON 26:
GOD MODE UNLOCKED Hardware Backdoors in redacted x86.
He also developed an assembler for that AIS (Alternate Instruction Set):
https://github.com/xoreaxeaxeax/rosenbridge, along with tools for activating it (or closing the vulnerability!)

After running 0F 3F (which jumps to EAX), AIS instructions are encoded with a 3-byte prefix in front of a 4-byte RISC instruction. (Not distinct from existing x86 instruction encodings, e.g. it takes over LEA and Bound, but you can otherwise mix Via RISC and x86 instructions.)

The AIS (Alternate Instruction Set) uses RISC-like fixed-width 32-bit instructions; thus we already know that not all possible uops can be encoded as RISC instructions. The machine decodes x86 instructions like 6-byte add eax, 0x12345678 (with a 32-bit immediate) to a single uop. But a 32-bit instruction word doesn't have room for a 32-bit constant and an opcode and destination register. So it's an alternate RISC-like ISA that's limited to a subset of things the back-end can execute and that their RISC decoder can decode from a 32-bit instruction.

(related: Could a processor be made that supports multiple ISAs? (ex: ARM + x86) discusses some challenges of doing this as more than a gimmick, like having a full ARM mode with actual expectations of performance, and all the addressing modes and instructions ARM requires.)

uops wouldn't be as nice as an actual ARM or PowerPC

@jalf's answer covers most of the reasons, but there's one interesting detail it doesn't mention: The internal RISC-like core isn't designed to run an instruction set quite like ARM/PPC/MIPS. The x86-tax isn't only paid in the power-hungry decoders, but to some degree throughout the core. i.e. it's not just the x86 instruction encoding; it's every instruction with weird semantics.

(Unless those clunky semantics are handled with multiple uops, in which case you can just use the one useful uop. e.g. for shl reg, cl with raw uops you could just leave out the inconvenient requirement to leave FLAGS unmodified when the shift-count is 0, which is why shl reg,cl is 3 uops on Intel SnB-family, so using raw uops would be great. Without raw uops, you need BMI2 shlx for single-uop shifts (which don't touch FLAGS at all).)

Let's pretend that Intel did create an operating mode where the instruction stream was something other than x86, with instructions that mapped more directly to uops. Let's also pretend that each CPU model has its own ISA for this mode, so they're still free to change the internals when they like, and expose them with a minimal amount of transistors for instruction-decode of this alternate format.

Presumably you'd still only have the same number of registers, mapped to the x86 architectural state, so x86 OSes can save/restore it on context switches without using the CPU-specific instruction set. But if we throw out that practical limitation, yes we could have a few more registers because we can use the hidden temp registers normally reserved for microcode¹.

If we just have alternate decoders with no changes to later pipeline stages (execution units), this ISA would still have many x86 eccentricities. It would not be a very nice RISC architecture. No single instruction would be very complex, but some of the other craziness of x86 would still be there.

For example: int->FP conversion like cvtsi2sd xmm0, eax merges into the low element of an XMM register, thus has a (false) dependency on the old register value. Even the AVX version just takes a separate arg for the register to merge into, instead of zero-extending into an XMM/YMM register. This is certainly not what you usually want, so GCC usually does an extra pxor xmm0, xmm0 to break the dependency on whatever was previously using XMM0. Similarly sqrtss xmm1, xmm2 merges into xmm1.

Again, nobody wants this (or in the rare case they do, could emulate it), but SSE1 was designed back in the Pentium III days when Intel's CPUs handled an XMM register as two 64-bit halves. Zero-extending into the full XMM register would have cost an extra uop on every scalar-float instruction in that core, but packed-float SIMD instructions were already 2 uops each. But this was very short-sighted; it wasn't long before P4 had full-width XMM registers. (Although when they returned to P6 cores after abandoning P4, Pentium-M and Core (not Core2) still had half-width XMM hardware.) Still, Intel's short-term gain for P-III is ongoing long-term pain for compilers, and future CPUs that have to run code with either extra instructions or possible false dependencies.

If you're going to make a whole new decoder for a RISC ISA, you can have it pick and choose parts of x86 instructions to be exposed as RISC instructions. This mitigates the x86-specialization of the core somewhat.

The instruction encoding would probably not be fixed-size, since single uops can hold a lot of data. Much more data than makes sense if all insns are the same size. A single micro-fused uop can add a 32bit immediate and a memory operand that uses an addressing mode with 2 registers and a 32bit displacement. (In SnB and later, only single-register addressing modes can micro-fuse with ALU ops).

uops are very large, and not very similar to fixed-width ARM instructions. A fixed-width 32bit instruction set can only load 16bit immediates at a time, so loading a 32bit address requires a load-immediate low-half / loadhigh-immediate pair. x86 doesn't have to do that, which helps it not be terrible with only 15 GP registers limiting the ability to keep constants around in registers. (15 is a big help over 7 registers, but doubling again to 31 helps a lot less, I think some simulation found. RSP is usually not general purpose, so it's more like 15 GP registers and a stack.)

TL;DR summary:

Anyway, this answer boils down to "the x86 instruction set is probably the best way to program a CPU that has to be able to run x86 instructions quickly", but hopefully sheds some light on the reasons.

Internal uop formats in the front-end vs. back-end

See also Micro fusion and addressing modes for one case of differences in what the front-end vs. back-end uop formats can represent on Intel CPUs.

Footnote 1: There are some "hidden" registers for use as temporaries by microcode. These registers are renamed just like the x86 architectural registers, so multi-uop instructions can execute out-of-order.

e.g. xchg eax, ecx on Intel CPUs decodes as 3 uops (why?), and our best guess is that these are MOV-like uops that do tmp = eax; ecx=eax ; eax=tmp;. In that order, because I measure the latency of the dst->src direction at ~1 cycle, vs. 2 for the other way. And these move uops aren't like regular mov instructions; they don't seem to be candidates for zero-latency mov-elimination.

See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for a mention of trying to experimentally measure PRF size, and having to account for physical registers used to hold architectural state, including hidden registers.

In the front-end after the decoders, but before the issue/rename stage that renames registers onto the physical register file, the internal uop format use register numbers similar to x86 reg numbers, but with room to address these hidden registers.

The uop format is somewhat different inside the out-of-order core (ROB and RS), aka back-end (after the issue/rename stage). The int/FP physical register files each have 168 entries in Haswell, so each register field in a uop needs to be wide enough to address that many.

Since the renamer is there in the HW, we'd probably be better off using it, instead of feeding statically scheduled instructions directly to the back-end. So we'd get to work with a set of registers as large as the x86 architectural registers + microcode temporaries, not more than that.

The back-end is designed to work with a front-end renamer that avoids WAW / WAR hazards, so we couldn't use it like an in-order CPU even if we wanted to. It doesn't have interlocks to detect those dependencies; that's handled by issue/rename.

It might be neat if we could feed uops into the back-end without the bottleneck of the issue/rename stage (the narrowest point in modern Intel pipelines, e.g. 4-wide on Skylake vs. 4 ALU + 2 load + 1 store ports in the back-end). But if you did that, I don't think you can statically schedule code to avoid register reuse and stepping on a result that's still needed if a cache-miss stalled a load for a long time.

So we pretty much need to feed uops to the issue/rename stage, probably only bypassing decode, not the uop cache or IDQ. Then we get normal OoO exec with sane hazard detection. The register allocation table is only designed to rename 16 + a few integer registers onto the 168-entry integer PRF. We couldn't expect the HW to rename a larger set of logical registers onto the same number of physical register; that would take a larger RAT.

votes

The answer is simple. Intel isn't developing CPUs for developers! They're developing them for the people who make the purchasing decisions, which BTW, is what every company in the world does!

Intel long ago made the commitment that, (within reason, of course), their CPUs would remain backward compatible. People want to know that, when they buy a new Intel based computer, that all of their current software will run exactly the same as it did on their old computer. (Although, hopefully, faster!)

Furthermore, Intel knows exactly how important that commitment is, because they once tried to go a different way. Exactly how many people do you know with an Itanium CPU?!?

You may not like it, but that one decision, to stay with the x86, is what made Intel one of the most recognizable business names in the world!

-3

votes

Why don't they allow us compile programs so they will bypass CISC instructions and use RISC core directly?

In addition to the previous answers, the another reason is market segmentation. Some instructions are thought to be implemented in microcode rather than in hardware, so allowing anyone to execute arbitrary microoperations can undermine sells of new cpus with "new" more performant CISC instructions.