How can I enforce ordering between writes and reads to global memory?

Question

I have a CUDA kernel of following form:

Void launch_kernel(..Arguments...) 
{  
    int i = threadIdx.x
    //Load required data 
    int temp1 = A[i];
    int temp2 = A[i+1];
    int temp3= A[i+2]; 
    // compute step 
    int output1 =  temp1 + temp2 + temp3;
    int output2 = temp1 + temp3;  
    // Store the result
    B[i] = output1; 
    C[i] = output2;
}

As discussed in CUDA manual, the consistency model for GPU global memory is not sequential. As a result, the memory operations may appear to be performed in order different than original program order. To enforce memory ordering CUDA offers __threadfence() functions. However, as per the manual, such function enforces relative ordering across reads and relative ordering across writes. Quoting a line from manual:

All writes to shared and global memory made by the calling thread before the call to __threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to shared memory and global memory made by the calling thread after the call to __threadfence_block();

So it is clear that __threadfence() is insufficient to enforce ordering among reads and writes.

How do I enforce the ordering across reads and writes to global memory. Alternatively, how do I make sure that all the reads are guaranteed to be completed before executing the compute and store section of above kernel.

The really short answer is that you can't. If you have problems which absolutely must have guaranteed order of memory transactions, then you have a problem which is probably completely unsuited to CUDA — talonmies
you can certainly enforce ordering of activity from all threads within a block. Your question is unclear about the scope of the ordering you would like to enforce. Are you only trying to enforce ordering with respect to the activity from a specific thread? — Robert Crovella
I want to enforce ordering on memory activities within a thread. In particular, as seen in sample kernel, just making sure that each thread finishes its read section before executing compute and store section is what I am looking for. I am certainly not looking to enforce ordering across memory operations across multiple threads. — Abhishek
Your code is fine. The memory barriers are primarily about enforcing visibility of activity to other observers. If the single thread behavior were as you suppose, CUDA would be horribly broken. You can expect that within a single thread, activities issued prior to a statement will complete before that statement is executed, if such an assertion actually would make a difference for code correctness. In your code, all of the loads of temp1, temp2, and temp3 will properly occur before they are used to compute the intermediate quantities output1 and output2. Likewise for B[i], C[i] — Robert Crovella

Jonathan Olson Jonathan Olson · Accepted Answer · 2017-07-12T13:55:31

Like @RobertCrovella said in his comment, your code will work fine as it is.

temp1, temp2, and temp3 are local (which will use either registers or local memory {per thread global memory}). These aren't shared between threads, so there's no concurrency concerns whatsoever. They will work just like regular C/C++.

A, B, and C are global. These will be subject to synchronization concerns. A is used as read only so access order doesn't matter. B and C are written, but each thread only writes to it's own index so the order they are written doesn't matter. Your concern about guaranteeing global memory reads are finished is unnecessary. Within a thread, your code will execute in the order written with appropriate stalls for global memory access. You wouldn't want to for performance reasons, but you can do things like B[i] = 0; B[i] = 5; temp1 = B[i]; and have temp1 guaranteed to be 5.

You don't use shared memory in this example, however it is local to thread blocks, and you can synchronize within the thread block using __syncthreads();

Synchronization of global memory across different thread blocks requires ending one kernel and beginning another. NVidia claims they are working on a better way in one of their future directions videos on youtube.

How can I enforce ordering between writes and reads to global memory?

1 Answers