amd and intel programmer's model compatibility

Question

I have read through Intel's Software Development Guide's (vol 1-3).

Without doing a doing a similar read through AMD's Programming Guides (vol 1-5), I am wondering what aspects of Intel and AMD's programming model are the same.

Of course, even within a family of processors, there will be model-specific registers and support for various extensions and functionality.

However, Intel does make some general statements about simple things that, in general, I am unsure if they carry to AMD. For example:

Cache line size
Memory order guarantees, per memory type
Atomic r/w guarantees, per memory type
ect.

Note, I am not asking about these examples specifically. I am asking if, from the programmer's perspective, in terms of writing code that is functionally equivalent, are the AMD and Intel programming model equivalent?

(Only concerned here with AMD64 and Intel 64 architectures)

Hey Peter. Thank you for a detailed answer - not only answering the question, but providing very relevant examples. — Abel

Peter Cordes Peter Cordes · Accepted Answer · 2019-01-26T16:00:59

In general not quite, the programming model is not always exactly equivalent. You need to check both sets of docs if you want to be 100% sure.

https://en.wikipedia.org/wiki/X86-64#Differences_between_AMD64_and_Intel_64

e.g. bsf/bsr: Intel docs say they leave the destination undefined, AMD says they leaves it unmodified on zero. But in practice Intel does that to, with a microarchitectural dependency on the output register to go with it. This false-dependency infected lzcnt/tzcnt as well until Skylake, and popcnt still, on Intel but not AMD. But until Intel gets around to putting it on paper that they're going to keep making their HW behave this way, compilers won't take advantage of it, and we maybe shouldn't by hand either.

(Wikipedia seems to be saying that on Intel, the upper 32 bits of the destination might be undefined, not zeroed, for bsr/bsf eax, ecx on Intel, though. So it's not strictly like always writing EAX. I can confirm this on SKL i7-6700k: mov rax,-1 ; bsf eax, ecx (with zeroed ECX) leaves RAX=-1 (64-bit), not truncated to 2^32-1. But with non-zero ECX, writing EAX has the usual effect of zero-extending into RAX.)

This is especially important for kernel code, privileged instruction behaviour may have more subtle differences. I think TLB invalidation semantics mostly match, e.g. it's guaranteed on both that you don't need to invalidate a TLB after changing an invalid entry to valid. Thus x86 disallows "negative caching", so an implementation that wanted to do so would have to snoop page-table stores for coherency.

Some of this is probably unintentional, like Intel and AMD both having different bugs for sysret with non-canonical x86-64 addresses, making it not safe to use after a ptrace system call could have modified the saved RIP. A potential GP fault can happen in kernel mode after switching to user-stack, handing control of the kernel to another user-space thread from the same process that can modify that stack memory. (https://blog.xenproject.org/2012/06/13/the-intel-sysret-privilege-escalation/) That's why Linux always uses iret except for the common case fast-path where the saved registers are known-clean. The comments in entry_64.S in the kernel source summarize a bit

Atomicity guarantees for unaligned cached loads/stores are weaker on AMD: boundaries as small as 8 bytes can matter on x86-64, because of AMD. Why is integer assignment on a naturally aligned variable atomic on x86? covers the common subset of that.

Cache line size has never been officially standardized. In practice Intel and AMD CPUs use 64-byte lines, and this can be queried at runtime using CPUID the same way on both.

AFAIK, memory-order rules are identical for WB at least, and probably for other types including WC and intereraction with LFENCE/SFENCE/MFENCE vs. lock add. Although it's not clearly documented by Intel if lock and xchg are intended to be different than mfence. But you're asking about the programming model itself, not just what the docs say on paper. See Does lock xchg have the same behavior as mfence? and What is the difference in logic and performance between LOCK XCHG and MOV+MFENCE?

IDK about AMD, but NT WC loads might reorder with lock add / xchg on Intel (but they're not supposed to with MFENCE, I think, and that's why an Intel ucode update had to strengthen MFENCE on Skylake to block OoO exec like LFENCE's other effect, to prevent later loads from being in the pipe at all.) @Bee's answer on the first link mentions this, and see the bottom of this. When testing real hardware, it's always hard to tell what's future-guaranteed behaviour, and what's merely an implementation detail, and that's where the manuals come in.

amd and intel programmer's model compatibility

1 Answers