When can the CPU ignore the LOCK prefix and use cache coherency?

Question

I originally thought cache coherency protocols such as MESI can provide pseudo-atomicity but only across individual memory-load/store instructions. If I was performing a fetch, modify, write combination of instructions, MESI-alone wouldn't be able to enforce atomicity across the first instruction to the last.

However, section 8 of the Intel reference manual Vol 3a says:

8.1.4 Effects of a LOCK Operation on Internal Processor Caches

For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

This seems to contradict my understanding by implying the LOCK instruction doesn't need to be used as cache coherency can be used?

Cache coherence is internal CPU mechanisem. Cache coherence is intended to manage such conflicts and maintain consistency between cache and memory. If only one thread access the cached memory than there is no reason to lock memory address. — GJ.
@GJ. What is "consistency" between cache and "memory"? (= RAM?) — curiousguy

Leeor Leeor · Accepted Answer · 2014-08-25T19:00:00

There's a difference between locking as a concept, and the actual bus #lock signal - the latter is one of the means of implementing the first. Cache locking is another one that is much simpler and more efficient.

MESI protocol guarantees that if a line is held exclusively by a certain core (either modified or not), no one else has it. In this case you can perform multiple operations atomically by adding simple flag in the cache that blocks external snoops until the operations are done. This would have the same effect as the lock concept dictates since no one else may change or even observe the intermediate values.

On more complicated cases, the line is not held by a single cache (for e.g. it may be shared between several ones, or the access may be split between two cache lines and only one is in your cache - the list of scenarios is usually implementation specific and probably not disclosed by the CPU manufacturer) - in such cases you may have to resort to "heavier" cannons like the bus lock, which usually guarantees no one can do anything on the shared bus. Obviously this has a huge impact on performance so this is probably only used when you have no other choice. In most cases a simple cache-level lock should be enough. Note that new schemes like Intel TSX seem to work in a similar manner, offering optimizations when you're working from within the cache.

By the way - your assumption about pseudo-atomicity for individual instruction is also wrong - it would be correct if you referred to a single memory operation (load or store), since an instruction may include multiple ones (inc [addr] for e.g. would not be atomic without a lock). Another restriction which also appears in your quote is that the access needs to be contained in a cache line - split lines don't guarantee atomicity even within a single load or store (since they're usually implemented as 2 memory operations that are later merged).

When can the CPU ignore the LOCK prefix and use cache coherency?

2 Answers