So i have an application that i like to implement using OpenCL which is distributed across multiple machines using MPI.
Now at every iteration of the algorithm i need to synchronize the buffers between the MPI processes, but here is the catch: only the borders of the 2D buffers need to be synchronized/copied, not the entire region.
So my question is if it is possible with OpenCL's memory mapping mechanism (clEnqueueMapBuffer & clEnqueueUnmapMemObject) to read/write only the borders of a 2D buffer without triggering a complete copy of the entire buffer.
Basically this can only work if OpenCL is using DMA instead of a host side buffer copy. So my question really is if OpenCL supports DMA access of device buffer data on a discrete PCIe GPU. And if yes, on what hardware and which operating system?