How do "acquire" and "consume" memory orders differ, and when is "consume" preferable?

Question

The C++11 standard defines a memory model (1.7, 1.10) which contains memory orderings, which are, roughly, "sequentially-consistent", "acquire", "consume", "release", and "relaxed". Equally roughly, a program is correct only if it is race-free, which happens if all actions can be put in some order in which one action happens-before another one. The way that an action X happens-before an action Y is that either X is sequenced before Y (within one thread), or X inter-thread-happens-before Y. The latter condition is given, among others, when

X synchronizes with Y, or
X is dependency-ordered before Y.

Synchronizing-with happens when X is an atomic store with "release" ordering on some atomic variable, and Y is an atomic load with "acquire" ordering on the same variable. Being dependency-ordered-before happens for the analogous situation where Y is load with "consume" ordering (and a suitable memory access). The notion of synchronizes-with extends the happens-before relationship transitively across actions being sequenced-before one another within a thread, but being dependency-ordered-before is extended transitively only through a strict subset of sequenced-before called carries-dependency, which follows a largish set of rules, and notably can be interrupted with std::kill_dependency.

Now then, what is the purpose of the notion of "dependency ordering"? What advantage does it provide over the simpler sequenced-before / synchronizes-with ordering? Since the rules for it are stricter, I assume that can be implemented more efficiently.

Can you give an example of a program where switching from release/acquire to release/consume is both correct and provides a non-trivial advantage? And when would std::kill_dependency provide an improvement? High-level arguments would be nice, but bonus points for hardware-specific differences.

Disclaimer: I just watched Herb Sutter's atomic<> Weapons talks, and he said that he won't discuss "consume" because "nobody understands it". — Kerrek SB
"And when would std::kill_dependency provide an improvement?" Related: stackoverflow.com/q/14779518/420683 and stackoverflow.com/q/7150395/420683 ; also note cppreference claims "On all mainstream CPUs other than DEC Alpha, dependency ordering is automatic, no additional CPU instructions are issued for this synchronization mode[...]" whereas this doesn't hold for release-acquire ordering (I think an example is ARM). — dyp
@Damon: No, he said that nobody understands what it means and how to use it. It's one thing to have an abstract description, and another to have an intimate understanding of how it's used correctly and effectively. Would you agree that there are very few people who understand how to write lock-free code properly? And that's a much simpler problem. — Kerrek SB
For those reading here, one key detail is that consume is not transitive, meaning if T2 consumes T1's changes, and T3 consumes T2's changes, T3 MAY not see all of T1's changes! With acquire/release, this transitive behavior does work, and T3 would see T1's changes. For most developers, this is much more intuitive than consume. However, on a few VERY large computers (1024+ cores), the cost of synchronizing more memory than needed could be very great. Consume did a good job of matching what was needed in those cases. — Cort Ammon

Cubbi Cubbi · Accepted Answer · 2013-10-31T18:29:25

Data dependency ordering was introduced by N2492 with the following rationale:

There are two significant use cases where the current working draft (N2461) does not support scalability near that possible on some existing hardware.

read access to rarely written concurrent data structures

Rarely written concurrent data structures are quite common, both in operating-system kernels and in server-style applications. Examples include data structures representing outside state (such as routing tables), software configuration (modules currently loaded), hardware configuration (storage device currently in use), and security policies (access control permissions, firewall rules). Read-to-write ratios well in excess of a billion to one are quite common.

publish-subscribe semantics for pointer-mediated publication

Much communication between threads is pointer-mediated, in which the producer publishes a pointer through which the consumer can access information. Access to that data is possible without full acquire semantics.

In such cases, use of inter-thread data-dependency ordering has resulted in order-of-magnitude speedups and similar improvements in scalability on machines that support inter-thread data-dependency ordering. Such speedups are possible because such machines can avoid the expensive lock acquisitions, atomic instructions, or memory fences that are otherwise required.

emphasis mine

the motivating use case presented there is rcu_dereference() from the Linux kernel

How do "acquire" and "consume" memory orders differ, and when is "consume" preferable?

4 Answers