First, some context: I'm working with a pre-C11, inline-asm-based atomic model, but for the purposes of this I'm happy to ignore the C aspect (and any compiler barrier issues, which I can deal with separately) and consider it essentially just an asm/cpu-architecture question.
Suppose I have code that looks like:
various stores
barrier
store flag
barrier
I want to be able to read flag
from another cpu core and conclude that the various stores
were already performed and made visible. Is it possible to do so without any kind of memory barrier instruction on the loading side? Clearly it's possible at least on some cpu architectures, for example x86 where an explicit memory barrier is not needed on either core. But what about in general? Does it vary widely by cpu arch whether this is possible?
smp_read_barrier_depends()
in the Linux kernel is only a barrier for Alpha, it seems that if there is a (possibly fake) address dependency on the reading side, the read barrier can be avoided (save for Alpha). Making the compiler preserve the dependency is a whole another issue. – ninjalj