Intel Xeon Phi OpenCL optimization guide suggests using Mapped buffers for data transfer between host and device memory. OpenCL spec also states that the technique is faster than having to write data explicitly to device memory. I am trying to measure the data transfer time from host-device, and from device-host.
My understanding is that OpenCL framework supports two ways of transferring data.
Here is my summarized scenario:
a. Explicit Method:
- Writing: ClWriteBuffer(...)
{ - Invoke execution on device: ClEnqueueNDRangeKernel(kernel) }
- Reading: ClReadBuffer(...)
Pretty simple.
b. Implicit Method:
- Writing: ClCreateBuffer(hostPtr, flag, ...) //Use flag CL_MEM_USE_USE_PTR. make sure to create aligned host buffer to map to.
{ - Invoke execution on device: ClEnqueueNDRangeKernel(kernel) }
- Reading: ClEnqueueMapBuffer(hostPtr, ...) //device relinquishes access to mapped memory back to host for reading processed data
Not very straight-forward.
I am using the second method. At what point does data transfer begin for both writing and reading? I need to insert timing code in the right place of my code in order to see how long it takes. So far, I have it inserted before ClEnqueueNDRangeKernel(kernel) for writing; and before ClEnqueueMapBuffer(hostPtr, ...) for reading. The numbers for my time are very small and I doubt that those are the points where data transmission from host to device memory (for this implicit method) actually begin.
Any clarifications on this towards profiling the data transfer involving the use of these three API commands will be greatly appreciated.
Thanks, Dave