Imagine a code that releases a shared pointer:
auto tmp = &(the_ptr->a);
*tmp = 10;
the_ptr.dec_ref();
If dec_ref() doesn't have a "release" semantic, it's perfectly fine for a compiler (or CPU) to move things from before dec_ref() to after it (for example):
auto tmp = &(the_ptr->a);
the_ptr.dec_ref();
*tmp = 10;
And this is not safe, since dec_ref() also can be called from other thread in the same time and delete the object.
So, it must have a "release" semantic for things before dec_ref() to stay there.
Lets now imagine that object's destructor looks like this:
~object() {
auto xxx = a;
printf("%i\n", xxx);
}
Also we will modify example a bit and will have 2 threads:
// thread 1
auto tmp = &(the_ptr->a);
*tmp = 10;
the_ptr.dec_ref();
// thread 2
the_ptr.dec_ref();
Then, the "aggregated" code will look like:
// thread 1
auto tmp = &(the_ptr->a);
*tmp = 10;
{ // the_ptr.dec_ref();
if (0 == atomic_sub(...)) {
{ //~object()
auto xxx = a;
printf("%i\n", xxx);
}
}
}
// thread 2
{ // the_ptr.dec_ref();
if (0 == atomic_sub(...)) {
{ //~object()
auto xxx = a;
printf("%i\n", xxx);
}
}
}
However, if we only have a "release" semantic for atomic_sub(), this code can be optimized that way:
// thread 2
auto xxx = the_ptr->a; // "auto xxx = a;" from destructor moved here
{ // the_ptr.dec_ref();
if (0 == atomic_sub(...)) {
{ //~object()
printf("%i\n", xxx);
}
}
}
But that way, destructor will not always print the last value of "a" (this code is not race free anymore). That's why we also need acquire semantic for atomic_sub (or, strictly speaking, we need an acquire barrier when counter becomes 0 after decrement).