0
votes

Intel Xeon Phi OpenCL optimization guide suggests using Mapped buffers for data transfer between host and device memory. OpenCL spec also states that the technique is faster than having to write data explicitly to device memory. I am trying to measure the data transfer time from host-device, and from device-host.

My understanding is that OpenCL framework supports two ways of transferring data.

Here is my summarized scenario:

a. Explicit Method:

- Writing: ClWriteBuffer(...)

{ - Invoke execution on device: ClEnqueueNDRangeKernel(kernel)  }

- Reading: ClReadBuffer(...)

Pretty simple.

b. Implicit Method:

- Writing: ClCreateBuffer(hostPtr, flag, ...)       //Use flag CL_MEM_USE_USE_PTR. make sure to create aligned host buffer to map to.

{ - Invoke execution on device: ClEnqueueNDRangeKernel(kernel)  }

- Reading: ClEnqueueMapBuffer(hostPtr, ...)          //device relinquishes access to mapped memory back to host for reading processed data

Not very straight-forward.

I am using the second method. At what point does data transfer begin for both writing and reading? I need to insert timing code in the right place of my code in order to see how long it takes. So far, I have it inserted before ClEnqueueNDRangeKernel(kernel) for writing; and before ClEnqueueMapBuffer(hostPtr, ...) for reading. The numbers for my time are very small and I doubt that those are the points where data transmission from host to device memory (for this implicit method) actually begin.

Any clarifications on this towards profiling the data transfer involving the use of these three API commands will be greatly appreciated.

Thanks, Dave

1
It is indeed a grey zone of the spec. Nobody details what is going on on the background, and there is no event to keep track of it. You can try with a "clMarquer" in order to measure exactly the time taken for each task in the queue.DarkZeros

1 Answers

1
votes

You need to use the manufacturer supplied tools (I think vtune amplifier did the job on Intel hardware) to see what actually happens in the device, as the OpenCL spec intentionally allows the implementation leeway on when to actually perform things.

So I can only give you the points on when the device is permitted to do work and when it's actually forced to do it.

Right after you call

ClCreateBuffer(hostPtr, flag, ...)

The device is allowed to begin reading the data. It can do this while your program runs normally as you are not permitted to write there until you call EnqueueMapBuffer. It's extremely likely that your call to EnqueueNDRangeKernel comes before the transfer is complete so it just hangs around in the command queue.

All these lines and the device is only permitted to work, nothing has yet forced it to work so in some cases it might not have actually done anything yet. But then comes the call that forces it to evaluate everything/wait for the calls to finish, assuming that you set it as a blocking call.

ClEnqueueMapBuffer(hostPtr, ...)

If you run this call with blocking_map as true you actually will get the ready made data back as of this moment. The implementation makes you wait inside that call until the data is in device, is processed by the kernel and then written back.

If you don't run this as a blocking map then the data is not necessarily back yet. So you have just issued 3 non blocking calls and the device can do whatever it wishes.

tl;dr: Everything from write, execution to read can happen inside the blocking clEnqueueMapBuffer call.