What about clEnqueueWriteBuffer (non-blocking) and
clEnqueueNDRangeKernel enqueued after, which uses that buffer
contents?
since it is in-order queue, it will first write then run the kernel after it finishes, even if the write is non-blocking.
clEnqueueBarrier is device-side synchronization command and is intended to work with out-of-order queues. When you use clFinish(), you make the api wait more for the communication between host and device. Enqueueing barrier is much faster synchronization but on the device side only. When you need to synchronize a queue with another queue and still need a similar sync point, you should use clEnqueueWaitForEvents just after(or before) the barrier or simply use only the even waiting(for in-order queue).
For opencl 1.2, clEnqueueWaitForEvents and clEnqueueBarrier was merged into clEnqueueBarrierWithWaitList which lets you both barrier out-of-order queue and synchronize it with other queues or even host-side-raised events.
If there is only single in-order queue, you don't need a barrier and when you need to synchronize with host, you can use clFinish or an event-based synchronization command.
or I get memory synchronization out-of-the-box for all the commands in
an in-order queue?
for only enqueue type commands, yes. Enqueue (1 write + 1 compute + 1 read) operations 128 times in an in-order queue, they all will work one after another and complete a 128-step simulation(after they are issued by a flush/finish command) . Commands don't have to be in a specific order for this implicit synchronization. Anything like 1 write + 2 reads + 2 kernels +5 writes +1 read + 1 kernel +15 reads work one after another(2 kernels = 1 kernel + 1kernel).
For non-enqueue type commands such as clSetKernelArg
, you have to use a synchronization point or do it before all enqueuing of commands.
You can also use enqueued commands themselves as an inter-queue sync point with its eventlist parameter and use the next parameter to get its completion event to be used in another queue(signaling) but its still not a barrier for out-of-order queue.
If a buffer is used for two kernels that are in different queues and they are to write data on that buffer, there must be synchronization between queues unless they are writing on different locations. So you can use 20 kernels working on each 1/20th of a buffer and work in parallel using multiple queues and finally synchronize all queues only in the end using a wait list. If a kernel uses or alters another kernels region concurrently, it is undefined behaviour. Similar process can be done for map/unmap too.
in-order vs out-of-order example:
r: read, w: write, c: compute
<------------clFinish----------------------->
in-order queue....: rwrwrwcwccwccwrcwccccccwrcwcwrwcwrwccccwrwcrw
out-of-order queue: <--r--><-r-><-c-><-----c-----><-----r-----><w>
<---w-------><-------r-----><------c----->
<---r---><-----c--------------------><--c->
<---w--->
<---c-----> <----w------>
and another out-of-order queue with a barrier in the middle:
<---r---><--w---> | <-----w---->
<---c------> | <----c---> <----w--->
<---c--------> | <------r------->
<----w------> | <----c------->
where read/write operations before barrier forced to wait until all commands hit same barrier. Then all remaining ones continue concurrently.
The last exemple shows, memory visibility from "host side" can be acquired by barrier or clfinish. But barrier doesn't inform host that it has finished so you need to query events about the queue. ClFinish blocks until all commands are finished so you don't need to query anything. Both will make host see the most updated memory.
Your question is about memory visibility for commands of an in-order queue, so you don't need a synchronization point for them to see each others most-up-to-date-values.
Each kernel execution is also a synchronization point between its work groups so work groups can't know other groups' data until kernel finishes and all data is prepared and becomes visible at the end of kernel execution. So next kernel can use it immediately.
I haven't tried to read data concurrently from device to host without any synchronization points but it may work for some devices that are not caching any data on any cache memory. Even integrated gpus have their dedicated L3 caches so it would need at least a barrier command once in a while, to let the host read some updated(but possibly partially re-updated in-flight) data. Event-based synchronization is faster than clFinish and gives correct memory data for host. Barrier is also faster than clFinish but only usable for device-side sync points.
If I understand correctly,
Sync Point ------------------------- Memory visibility
in-kernel fence in same workitem(and wavefront?)
in-kernel local memory barrier local memory in same workgroup
in-kernel global memory barrier global memory in same workgroup
in-kernel atomics only other atomics in same kernel
enqueued kernel/command next kernel/command in same queue
enqueued barrier following commands in same device
enqueued event wait host
clFinish host
https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/clEnqueueMapBuffer.html
If the buffer object is created with CL_MEM_USE_HOST_PTR set in
mem_flags, the host_ptr specified in clCreateBuffer is guaranteed to
contain the latest bits in the region being mapped when the
clEnqueueMapBuffer command has completed; and the pointer value
returned by clEnqueueMapBuffer will be derived from the host_ptr
specified when the buffer object is created.
and
https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/clEnqueueWriteBuffer.html
All commands that use this buffer object or a memory object (buffer or
image) created from this buffer object have finished execution before
the read command begins execution.
so it doesn't say anything like a barrier or sync. Completion is just enough.