Is there device side pointer of host memory for kernel use in OpenCL (like CUDA)?

Question

In CUDA, we can achieve kernel managed data transfer from host memory to device shared memory by device side pointer of host memory. Like this:

int  *a,*b,*c;          // host pointers
int *dev_a, *dev_b, *dev_c;     // device pointers to host memory

    …       

cudaHostGetDevicePointer(&dev_a, a, 0); // mem. copy to device not need now, but ptrs needed instead
cudaHostGetDevicePointer(&dev_b, b, 0);
cudaHostGetDevicePointer(&dev_c ,c, 0);

    …   

//kernel launch
add<<<B,T>>>(dev_a,dev_b,dev_c); 
// dev_a, dev_b, dev_c are passed into kernel for kernel accessing host memory directly.

In the above example, kernel code can access host memory via dev_a, dev_b and dev_c. Kernel can utilize these pointers to move data from host to shared memory directly without relaying them by global memory.

But seems that it is an mission impossible in OpenCL? (local memory in OpenCL is the counterpart of shared memory in CUDA)

It is impossible to have the same functionality. However, it is not a core feature. I can't think of a case where you are really forced to do this kind of operation. Isn't it possible to just call clEnqueueWriteBuffer() before calling the kernel? It the device local cache a limitting factor for you? — DarkZeros
It is a good manner for overlapping transfer and computation. Explicit copy operation issued from host side is avoided, which can only copy data from host to device global memory. By device side pointer, data can be transfered between host and shared memory directly. It allows device schedule computation and data transfer as its need, which implies data transfer may be hidden. Traditional way is multi streams (CUDA) and multi command queues(OpenCL). The traditional way needs explicit scheduling in host side, which makes the overall code a little bit ugly/hairy. — jxj

sharpneli sharpneli · Accepted Answer · 2013-11-05T09:07:05

You can find exactly identical API in OpenCL.

How it works on CUDA:

According to this presentation and the official documentation.

The money quote about cudaHostGetDevicePointer :

Passes back device pointer of mapped host memory allocated by cudaHostAlloc or registered by cudaHostRegister.

CUDA cudaHostAlloc with cudaHostGetDevicePointer works exactly like CL_MEM_ALLOC_HOST_PTR with MapBuffer works in OpenCL. Basically if it's a discrete GPU the results are cached in the device and if it's a discrete GPU with shared memory with the host it will use the memory directly. So there is no actual 'zero copy' operation with discrete GPU in CUDA.

The function cudaHostGetDevicePointer does not take raw malloced pointers in, just like what is the limitation in OpenCL. From the API users point of view those two are exactly identical approaches allowing the implementation to do pretty much identical optimizations.

With discrete GPU the pointer you get points to an area where the GPU can directly transfer stuff in via DMA. Otherwise the driver would take your pointer, copy the data to the DMA area and then initiate the transfer.

However in OpenCL2.0 that is explicitly possible, depending on the capabilities of your devices. With the finest granularity sharing you can use randomly malloced host pointers and even use atomics with the host, so you could even dynamically control the kernel from the host while it is running.

http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf

See page 162 for the shared virtual memory spec. Do note that when you write kernels even these are still just __global pointers from the kernel point of view.

Is there device side pointer of host memory for kernel use in OpenCL (like CUDA)?

1 Answers