4
votes

I'm interested in sequentially-consistent load operation on x86.

As far as I see from assembler listing, generated by compiler it is implemented as a plain load on x86, however plain loads as far as I know guaranteed to have acquire semantics, while plain stores are guaranteed to have release.

Sequentially-consistent store is implemented as locked xchg, while load as plain load. That sounds strange to me, could you please explain this in details?

added

Just found in internet, that sequentially-consistent atomic load could be done as simple mov as long as store is done with locked xchg, but there was no proof and no links to documentation.

3

3 Answers

10
votes

A plain MOV on x86 is sufficient for an atomic sequentially consistent load, as long as SC stores are done with LOCKed instructions, the value is correctly aligned, and "normal" WB cache mode is used.

See my blog post at http://www.justsoftwaresolutions.co.uk/threading/intel-memory-ordering-and-c++-memory-model.html for the full mapping, and the Intel processor docs at http://developer.intel.com/products/processor/manuals/index.htm for the details of the allowed orderings.

If you use "WC" cache mode or "non-temporal" instructions such as MOVNTI then all bets are off, as the processor doesn't necessarily write the data back to main memory in a timely manner.

2
votes

Reads on x86 are by nature atomic, so long as they are aligned, the section under the MOV instruction in the intel assembly manuals vol 2A should mention this, same with the LOCK prefix. Other volumes may also mention this

however, if you want an atomic read, you can use _InterlockedExchangeAdd((LONG*)&var,0) aka LOCK XADD, this will yield the old value, but won't change its value, the same can be done with InterlockCompareExchange((LONG*)&var,var,var) aka LOCK CMPXCHG, but IMO, there is no need for this

1
votes

Register to memory transfers and vice versa are not necessarily atomic in an multiprocessor environment.

READING

XOR EAX, EAX
LOCK XADD [address], EAX

This first instruction will zero the EAX register, the second instruction will exchange the content of both EAX with [address] and will store the sum of both in [address] again. Since EAX register was zero before, nothing gets changed.

WRITING

XCHG [address], EAX

EAX register will get the value to store to specified address.

EDIT: LOCK ADD EAX, [address] will cause an "Invalid Opcode Exception" because destination operand is no memory address.

An invalidopcode exception (#UD) is generated when the LOCK prefix is used with any other instruction or when no write operation is made to memory. 8.1.2.2 Software Controlled Bus Locking

Edit 2: Summarizes information from comments.

While

"[...] the processor’s locking protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or absence of the LOCK prefix or of the value of the IOPL."

There are restrictions to this

"Accesses to cacheable memory that are split across bus widths, cache lines, and page boundaries are not guaranteed to be atomic"