Memory visibility for an in-order queue

Question

After reading the OpenCL 1.1 standard I still can't grasp whether in-order command queue does guarantee memory visibility for any pair of commands (not only kernels) according to their enqueueing order.

OpenCL standard 1.1 section 5.11 states:

If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed. If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A.

What about clEnqueueWriteBuffer (non-blocking) and clEnqueueNDRangeKernel enqueued after, which uses that buffer contents?

AFAIK, 'finishes execution' does not imply that corresponding writes are visible (due to relaxed consistency). For example, section 5.10 states specifically:

The clEnqueueBarrier command ensures that all queued commands in command_queue have finished execution before the next batch of commands can begin execution. The clEnqueueBarrier command is a synchronization point.

In other words, should I rely on other 'synchronization points'-related rules (events, etc.), or I get memory synchronization out-of-the-box for all the commands in an in-order queue?

> What about clEnqueueWriteBuffer (non-blocking) and clEnqueueNDRangeKernel enqueued after, which uses that buffer contents? You do not need any other synchronisation points. All the commands in a queue are guaranteed to be executed in-order. — kanna

huseyin tugrul buyukisik huseyin tugrul buyukisik · Accepted Answer · 2017-03-01T11:42:27

What about clEnqueueWriteBuffer (non-blocking) and clEnqueueNDRangeKernel enqueued after, which uses that buffer contents?

since it is in-order queue, it will first write then run the kernel after it finishes, even if the write is non-blocking.

clEnqueueBarrier is device-side synchronization command and is intended to work with out-of-order queues. When you use clFinish(), you make the api wait more for the communication between host and device. Enqueueing barrier is much faster synchronization but on the device side only. When you need to synchronize a queue with another queue and still need a similar sync point, you should use clEnqueueWaitForEvents just after(or before) the barrier or simply use only the even waiting(for in-order queue).

For opencl 1.2, clEnqueueWaitForEvents and clEnqueueBarrier was merged into clEnqueueBarrierWithWaitList which lets you both barrier out-of-order queue and synchronize it with other queues or even host-side-raised events.

If there is only single in-order queue, you don't need a barrier and when you need to synchronize with host, you can use clFinish or an event-based synchronization command.

or I get memory synchronization out-of-the-box for all the commands in an in-order queue?

for only enqueue type commands, yes. Enqueue (1 write + 1 compute + 1 read) operations 128 times in an in-order queue, they all will work one after another and complete a 128-step simulation(after they are issued by a flush/finish command) . Commands don't have to be in a specific order for this implicit synchronization. Anything like 1 write + 2 reads + 2 kernels +5 writes +1 read + 1 kernel +15 reads work one after another(2 kernels = 1 kernel + 1kernel).

For non-enqueue type commands such as clSetKernelArg, you have to use a synchronization point or do it before all enqueuing of commands.

You can also use enqueued commands themselves as an inter-queue sync point with its eventlist parameter and use the next parameter to get its completion event to be used in another queue(signaling) but its still not a barrier for out-of-order queue.

If a buffer is used for two kernels that are in different queues and they are to write data on that buffer, there must be synchronization between queues unless they are writing on different locations. So you can use 20 kernels working on each 1/20th of a buffer and work in parallel using multiple queues and finally synchronize all queues only in the end using a wait list. If a kernel uses or alters another kernels region concurrently, it is undefined behaviour. Similar process can be done for map/unmap too.

in-order vs out-of-order example:

r: read, w: write, c: compute


                    <------------clFinish----------------------->
in-order queue....: rwrwrwcwccwccwrcwccccccwrcwcwrwcwrwccccwrwcrw

out-of-order queue: <--r--><-r-><-c-><-----c-----><-----r-----><w>
                     <---w-------><-------r-----><------c----->
                      <---r---><-----c--------------------><--c->
                       <---w---> 
                     <---c-----> <----w------>

and another out-of-order queue with a barrier in the middle:

                    <---r---><--w---> | <-----w----> 
                    <---c------>      | <----c---> <----w--->
                       <---c--------> | <------r------->
                     <----w------>    | <----c------->

where read/write operations before barrier forced to wait until all commands hit same barrier. Then all remaining ones continue concurrently.

The last exemple shows, memory visibility from "host side" can be acquired by barrier or clfinish. But barrier doesn't inform host that it has finished so you need to query events about the queue. ClFinish blocks until all commands are finished so you don't need to query anything. Both will make host see the most updated memory.

Your question is about memory visibility for commands of an in-order queue, so you don't need a synchronization point for them to see each others most-up-to-date-values.

Each kernel execution is also a synchronization point between its work groups so work groups can't know other groups' data until kernel finishes and all data is prepared and becomes visible at the end of kernel execution. So next kernel can use it immediately.

I haven't tried to read data concurrently from device to host without any synchronization points but it may work for some devices that are not caching any data on any cache memory. Even integrated gpus have their dedicated L3 caches so it would need at least a barrier command once in a while, to let the host read some updated(but possibly partially re-updated in-flight) data. Event-based synchronization is faster than clFinish and gives correct memory data for host. Barrier is also faster than clFinish but only usable for device-side sync points.

If I understand correctly,

 Sync Point   -------------------------    Memory visibility

 in-kernel fence                           in same workitem(and wavefront?)
 in-kernel local memory barrier            local memory in same workgroup
 in-kernel global memory barrier           global memory in same workgroup
 in-kernel atomics                         only other atomics in same kernel
 enqueued kernel/command                   next kernel/command in same queue
 enqueued barrier                          following commands in same device
 enqueued event wait                       host
 clFinish                                  host

https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/clEnqueueMapBuffer.html

If the buffer object is created with CL_MEM_USE_HOST_PTR set in mem_flags, the host_ptr specified in clCreateBuffer is guaranteed to contain the latest bits in the region being mapped when the clEnqueueMapBuffer command has completed; and the pointer value returned by clEnqueueMapBuffer will be derived from the host_ptr specified when the buffer object is created.

and

https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/clEnqueueWriteBuffer.html

All commands that use this buffer object or a memory object (buffer or image) created from this buffer object have finished execution before the read command begins execution.

so it doesn't say anything like a barrier or sync. Completion is just enough.

Memory visibility for an in-order queue

2 Answers