Via C3 processors do allow something like this, after enabling it via an MSR and executing an undocumented 0F 3F
instruction to activate the https://en.wikipedia.org/wiki/Alternate_Instruction_Set which doesn't enforce the usual privileged (ring 0) vs. unprivileged (ring 3) protections. (Unfortunately Via Samuel II shipped with the MSR setting to allow this defaulting to allowed. And they didn't document it, so OSes didn't know they should turn off that capability. Other Via CPUs default to disabled.)
See Christopher Domas's talk from DEF CON 26:
GOD MODE UNLOCKED Hardware Backdoors in redacted x86.
He also developed an assembler for that AIS (Alternate Instruction Set):
https://github.com/xoreaxeaxeax/rosenbridge, along with tools for activating it (or closing the vulnerability!)
After running 0F 3F
(which jumps to EAX), AIS instructions are encoded with a 3-byte prefix in front of a 4-byte RISC instruction. (Not distinct from existing x86 instruction encodings, e.g. it takes over LEA and Bound, but you can otherwise mix Via RISC and x86 instructions.)
The AIS (Alternate Instruction Set) uses RISC-like fixed-width 32-bit instructions; thus we already know that not all possible uops can be encoded as RISC instructions. The machine decodes x86 instructions like 6-byte add eax, 0x12345678
(with a 32-bit immediate) to a single uop. But a 32-bit instruction word doesn't have room for a 32-bit constant and an opcode and destination register. So it's an alternate RISC-like ISA that's limited to a subset of things the back-end can execute and that their RISC decoder can decode from a 32-bit instruction.
(related: Could a processor be made that supports multiple ISAs? (ex: ARM + x86) discusses some challenges of doing this as more than a gimmick, like having a full ARM mode with actual expectations of performance, and all the addressing modes and instructions ARM requires.)
uops wouldn't be as nice as an actual ARM or PowerPC
@jalf's answer covers most of the reasons, but there's one interesting detail it doesn't mention: The internal RISC-like core isn't designed to run an instruction set quite like ARM/PPC/MIPS. The x86-tax isn't only paid in the power-hungry decoders, but to some degree throughout the core. i.e. it's not just the x86 instruction encoding; it's every instruction with weird semantics.
(Unless those clunky semantics are handled with multiple uops, in which case you can just use the one useful uop. e.g. for shl reg, cl
with raw uops you could just leave out the inconvenient requirement to leave FLAGS unmodified when the shift-count is 0
, which is why shl reg,cl
is 3 uops on Intel SnB-family, so using raw uops would be great. Without raw uops, you need BMI2 shlx
for single-uop shifts (which don't touch FLAGS at all).)
Let's pretend that Intel did create an operating mode where the instruction stream was something other than x86, with instructions that mapped more directly to uops. Let's also pretend that each CPU model has its own ISA for this mode, so they're still free to change the internals when they like, and expose them with a minimal amount of transistors for instruction-decode of this alternate format.
Presumably you'd still only have the same number of registers, mapped to the x86 architectural state, so x86 OSes can save/restore it on context switches without using the CPU-specific instruction set. But if we throw out that practical limitation, yes we could have a few more registers because we can use the hidden temp registers normally reserved for microcode1.
If we just have alternate decoders with no changes to later pipeline stages (execution units), this ISA would still have many x86 eccentricities. It would not be a very nice RISC architecture. No single instruction would be very complex, but some of the other craziness of x86 would still be there.
For example: int->FP conversion like cvtsi2sd xmm0, eax
merges into the low element of an XMM register, thus has a (false) dependency on the old register value. Even the AVX version just takes a separate arg for the register to merge into, instead of zero-extending into an XMM/YMM register. This is certainly not what you usually want, so GCC usually does an extra pxor xmm0, xmm0
to break the dependency on whatever was previously using XMM0. Similarly sqrtss xmm1, xmm2
merges into xmm1.
Again, nobody wants this (or in the rare case they do, could emulate it), but SSE1 was designed back in the Pentium III days when Intel's CPUs handled an XMM register as two 64-bit halves. Zero-extending into the full XMM register would have cost an extra uop on every scalar-float instruction in that core, but packed-float SIMD instructions were already 2 uops each. But this was very short-sighted; it wasn't long before P4 had full-width XMM registers. (Although when they returned to P6 cores after abandoning P4, Pentium-M and Core (not Core2) still had half-width XMM hardware.) Still, Intel's short-term gain for P-III is ongoing long-term pain for compilers, and future CPUs that have to run code with either extra instructions or possible false dependencies.
If you're going to make a whole new decoder for a RISC ISA, you can have it pick and choose parts of x86 instructions to be exposed as RISC instructions. This mitigates the x86-specialization of the core somewhat.
The instruction encoding would probably not be fixed-size, since single uops can hold a lot of data. Much more data than makes sense if all insns are the same size. A single micro-fused uop can add a 32bit immediate and a memory operand that uses an addressing mode with 2 registers and a 32bit displacement. (In SnB and later, only single-register addressing modes can micro-fuse with ALU ops).
uops are very large, and not very similar to fixed-width ARM instructions. A fixed-width 32bit instruction set can only load 16bit immediates at a time, so loading a 32bit address requires a load-immediate low-half / loadhigh-immediate pair. x86 doesn't have to do that, which helps it not be terrible with only 15 GP registers limiting the ability to keep constants around in registers. (15 is a big help over 7 registers, but doubling again to 31 helps a lot less, I think some simulation found. RSP is usually not general purpose, so it's more like 15 GP registers and a stack.)
TL;DR summary:
Anyway, this answer boils down to "the x86 instruction set is probably the best way to program a CPU that has to be able to run x86 instructions quickly", but hopefully sheds some light on the reasons.
Internal uop formats in the front-end vs. back-end
See also Micro fusion and addressing modes for one case of differences in what the front-end vs. back-end uop formats can represent on Intel CPUs.
Footnote 1: There are some "hidden" registers for use as temporaries by microcode. These registers are renamed just like the x86 architectural registers, so multi-uop instructions can execute out-of-order.
e.g. xchg eax, ecx
on Intel CPUs decodes as 3 uops (why?), and our best guess is that these are MOV-like uops that do tmp = eax; ecx=eax ; eax=tmp;
. In that order, because I measure the latency of the dst->src direction at ~1 cycle, vs. 2 for the other way. And these move uops aren't like regular mov
instructions; they don't seem to be candidates for zero-latency mov-elimination.
See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for a mention of trying to experimentally measure PRF size, and having to account for physical registers used to hold architectural state, including hidden registers.
In the front-end after the decoders, but before the issue/rename stage that renames registers onto the physical register file, the internal uop format use register numbers similar to x86 reg numbers, but with room to address these hidden registers.
The uop format is somewhat different inside the out-of-order core (ROB and RS), aka back-end (after the issue/rename stage). The int/FP physical register files each have 168 entries in Haswell, so each register field in a uop needs to be wide enough to address that many.
Since the renamer is there in the HW, we'd probably be better off using it, instead of feeding statically scheduled instructions directly to the back-end. So we'd get to work with a set of registers as large as the x86 architectural registers + microcode temporaries, not more than that.
The back-end is designed to work with a front-end renamer that avoids WAW / WAR hazards, so we couldn't use it like an in-order CPU even if we wanted to. It doesn't have interlocks to detect those dependencies; that's handled by issue/rename.
It might be neat if we could feed uops into the back-end without the bottleneck of the issue/rename stage (the narrowest point in modern Intel pipelines, e.g. 4-wide on Skylake vs. 4 ALU + 2 load + 1 store ports in the back-end). But if you did that, I don't think you can statically schedule code to avoid register reuse and stepping on a result that's still needed if a cache-miss stalled a load for a long time.
So we pretty much need to feed uops to the issue/rename stage, probably only bypassing decode, not the uop cache or IDQ. Then we get normal OoO exec with sane hazard detection. The register allocation table is only designed to rename 16 + a few integer registers onto the 168-entry integer PRF. We couldn't expect the HW to rename a larger set of logical registers onto the same number of physical register; that would take a larger RAT.