TL;DR: I understood MOVNTI operations are not ordered relative to the rest of the program, so SFENCE/MFENCE is needed. But are MOVNTI operations not ordered relative to other MOVNTI operations of the same thread?
Assuming I have a producer-consumer queue, and I want to use MOVNTI on producer side to avoid cache pollution.
(Has not actually observed cache pollution effect yet, so it is probably theory question for now)
So I'm replacing the following producer:
std::atomic<std::size_t> producer_index;
QueueElement queue_data[MAX_SIZE];
...
void producer()
{
for (;;)
{
...
queue_data[i].d1 = v1;
queue_data[i].d2 = v2;
...
queue_data[i].dN = vN;
producer_index.store(i, std::memory_order_release);
}
}
With the following:
void producer()
{
for (;;)
{
...
_mm_stream_si64(&queue_data[i].d1, v1);
_mm_stream_si64(&queue_data[i].d2, v2);
...
_mm_stream_si64(&queue_data[i].dN, vN);
_mm_sfence();
producer_index.store(i, std::memory_order_release);
}
}
Notice I added _mm_sfence, which would wait until "non-temporal" operation results become observable.
If I don't add it, consumer may observe producer_index before queue_data changes.
But what if I write index with _mm_stream_si64 too?
std::size_t producer_index_value;
std::atomic_ref<std::size_t> producer_index { producer_index_value };
void producer()
{
for (;;)
{
...
_mm_stream_si64(&queue_data[i].d1, v1);
_mm_stream_si64(&queue_data[i].d2, v2);
...
_mm_stream_si64(&queue_data[i].dN, vN);
_mm_stream_si64(&producer_index_value, i);
}
}
According to my reading of Intel manuals, this shouldn't work, as non-temporal store has relaxed ordering.
But didn't they say "relaxed" only to make non-temporal operation not ordered against the rest of the program?
Maybe they are ordered within themselves, so the producer still would work as expected?
And if MOVNTI is truly relaxed, so that the latest code is incorrect, what is the reason for memory writes to be reordered?