xacquire/xrelease
are just F2/F3 REP prefixes and are safely ignored by all CPUs that don't support that feature, including non-Intel. That's why Intel chose that encoding for the prefixes. It's even better than a NOP that has to decode as a separate instruction.
In general (across vendors), CPUs ignore REP prefixes they don't understand. So new extensions can use REP as part of their encoding if it's useful for them to decode as something else on old CPUs, instead of #UD
.
I don't think it's plausible for AMD to introduce an incompatible meaning for rep
prefixes on lock
ed instructions or mov-stores - that would break real-world binaries that already uses these prefixes. For example I'm pretty sure some builds of libpthread in mainstream GNU/Linux distros have used this to enable hardware lock elision, and don't use dynamic CPU dispatching to run different code based on CPUID for this.
Using REP as a mandatory prefix for a backwards-compat new instruction has been done before, e.g. with rep nop
= pause
or rep bsf
= tzcnt
. (Useful for compilers because tzcnt
is faster on some CPUs, and gives the same result if the input is known non-zero.) And rep ret
as a workaround for AMD pre-Bulldozer branch predictors is widely used by GCC - What does `rep ret` mean?. That meaningless REP definitely works (silently ignored) in practice on AMD.
(The reverse is not true. You can't write software that counts on a meaningless REP prefix being ignored by future CPUs. Some later extension might give it a meaning, e.g. like with rep bsr
which runs as lzcnt
and gives a different result. This is why Intel documents the effect of meaningless prefixes as "undefined".)
I'd like to enhance it using the Intel TSX prefixes, specifically XACQUIRE and XRELEASE.
Unfortunately microcode updates have apparently disabled the HLE (Hardware Lock Elision) part of TSX on all Intel CPUs. (Perhaps to mitigate TAA side-channel attacks). This was the same update that made jcc
at the end of a 32-byte block be uncacheable in the uop cache, so it's hard to tell from benchmarking existing code what perf impact the no-HLE part has.
https://news.ycombinator.com/item?id=21533791 / Has Hardware Lock Elision gone forever due to Spectre Mitigation? (yes gone, but no the reason probably isn't Spectre specifically. IDK if it will be back.)
If you want to use hardware transactional memory on x86, I think your only option is RTM (xbegin
/xend
), the other half of TSX. OSes can disable it, too, after the most recent microcode update; I'm not sure what the default is for typical systems, and this may change in the future, so this is something to check on before putting development time into anything.
There isn't AFAIK a way to use RTM but transparently fall back to locking; xbegin / xend are illegal instructions that fault with #UD
if the CPUID feature bit isn't present.
If you wanted transparent backwards compat, you were supposed to use HLE so it's a real shame that it (and TSX in general) has had such a rough time, repeatedly getting disabled by microcode updates. (Previously in Haswell and Broadwell because of possible correctness bugs. It's turning into a Charlie Brown situation.)