Are MOVNTI stores reordered relative to other MOVNTI stores made by the same thread?

Question

TL;DR: I understood MOVNTI operations are not ordered relative to the rest of the program, so SFENCE/MFENCE is needed. But are MOVNTI operations not ordered relative to other MOVNTI operations of the same thread?

Assuming I have a producer-consumer queue, and I want to use MOVNTI on producer side to avoid cache pollution.

(Has not actually observed cache pollution effect yet, so it is probably theory question for now)

So I'm replacing the following producer:

std::atomic<std::size_t> producer_index;
QueueElement queue_data[MAX_SIZE];
...
void producer()
{
    for (;;)
    {
        ...

        queue_data[i].d1 = v1;
        queue_data[i].d2 = v2;
        ...
        queue_data[i].dN = vN;

        producer_index.store(i, std::memory_order_release);
    }
}

With the following:

void producer()
{
    for (;;)
    {
        ...

        _mm_stream_si64(&queue_data[i].d1, v1);
        _mm_stream_si64(&queue_data[i].d2, v2);
        ...
        _mm_stream_si64(&queue_data[i].dN, vN);

        _mm_sfence();

        producer_index.store(i, std::memory_order_release);
    }
}

Notice I added _mm_sfence, which would wait until "non-temporal" operation results become observable. If I don't add it, consumer may observe producer_index before queue_data changes.

But what if I write index with _mm_stream_si64 too?

std::size_t producer_index_value;
std::atomic_ref<std::size_t> producer_index { producer_index_value };

void producer()
{
    for (;;)
    {
        ...

        _mm_stream_si64(&queue_data[i].d1, v1);
        _mm_stream_si64(&queue_data[i].d2, v2);
        ...
        _mm_stream_si64(&queue_data[i].dN, vN);

        _mm_stream_si64(&producer_index_value, i);
    }
}

According to my reading of Intel manuals, this shouldn't work, as non-temporal store has relaxed ordering.

But didn't they say "relaxed" only to make non-temporal operation not ordered against the rest of the program? Maybe they are ordered within themselves, so the producer still would work as expected?

And if MOVNTI is truly relaxed, so that the latest code is incorrect, what is the reason for memory writes to be reordered?

Peter Cordes Peter Cordes · Accepted Answer · 2020-06-06T18:43:05

movnti stores are weakly ordered relative to each other as well. In asm you definitely need sfence after storing the data to get release semantics for the store to producer_index, whether you do that with movnti or a plain mov store.

It might happen to work most of the time that the separate store wouldn't become visible to other threads until after some full-line writes using NT stores. Likely in fact: completing a cache line triggers a flush of the WC buffer to DRAM (bypassing / evicting cache), but the index will definitely not be a full line store unless it happens to be contiguous with the end of the data written.

In C++ that means using _mm_sfence() before whatever you do to store to producer_index.

Note that using movnti for a single scalar is a really bad idea: it forces the cache line to be evicted from cache so the reader have to fetch it all the way from DRAM. i.e. it will increase inter-core latency for that control variable that otherwise would probably hit in L3.

Only use NT stores when you expect to complete a whole cache line, and when you don't expect another thread to be reloading the data soon.

Are MOVNTI stores reordered relative to other MOVNTI stores made by the same thread?

1 Answers