I have a CUDA kernel of following form:
Void launch_kernel(..Arguments...)
{
int i = threadIdx.x
//Load required data
int temp1 = A[i];
int temp2 = A[i+1];
int temp3= A[i+2];
// compute step
int output1 = temp1 + temp2 + temp3;
int output2 = temp1 + temp3;
// Store the result
B[i] = output1;
C[i] = output2;
}
As discussed in CUDA manual, the consistency model for GPU global memory is not sequential. As a result, the memory operations may appear to be performed in order different than original program order. To enforce memory ordering CUDA offers __threadfence() functions. However, as per the manual, such function enforces relative ordering across reads and relative ordering across writes. Quoting a line from manual:
All writes to shared and global memory made by the calling thread before the call to __threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to shared memory and global memory made by the calling thread after the call to __threadfence_block();
So it is clear that __threadfence() is insufficient to enforce ordering among reads and writes.
How do I enforce the ordering across reads and writes to global memory. Alternatively, how do I make sure that all the reads are guaranteed to be completed before executing the compute and store section of above kernel.
temp1
,temp2
, andtemp3
will properly occur before they are used to compute the intermediate quantitiesoutput1
andoutput2
. Likewise forB[i]
,C[i]
– Robert Crovella