1
votes

I'm learning OpenCL and attempt to utilize it on some low-latency scenario, so I'm really concerned with the memory transferring delay.

According to NVidia's OpenCL Best Practices Guide, and also by many other places, direct read/write on buffer object should be avoided. Instead, we should use map/unmap utility. In that guide, a demonstrative code is given like this:

cl_mem cmPinnedBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, NULL);
cl_mem cmDevBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, memSize, NULL, NULL);
unsigned char* cDataIn = (unsigned char*) clEnqueueMapBuffer(cqCommandQue, cmPinnedBufIn, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, NULL);

for(unsigned int i = 0; i < memSize; i++)
{
    cDataIn[i] = (unsigned char)(i & 0xff);
}

clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0, szBuffBytes, cDataIn , 0, NULL, NULL);

In this code snippet, two buffer objects are generated explicitly, and a write-to-device operation is also explicitly called.

If my understanding is correct, when you call clCreateBuffer with CL_MEM_ALLOC_HOST_PTR OR CL_MEM_USE_HOST_PTR, the storage of buffer object is created in on host side, probably in DMA memory, and no storage is allocated on device side. So the above code actually creates two separated storage. If so:

  • What would happen if I call map buffer on cmDevBufIn, which do not have host side memory?

  • For CPU-integrated GPUs, there is no separate graphics memory. Especially, for new version of AMD APUs, the memory address is also homologus. So it seems create two buffer objects is not good. What is the best practice for integrated platforms?

  • Is there any way to write single lines of memory transfer code for different platforms? Or I must write several different suits of memory transfer codes to achieve best performance for Nvidia, AMD separate GPU, AMD old APU, AMD new APU and Intel HD graphics......

1

1 Answers

3
votes

Unfortunately, it's different for each vendor.

NVIDIA claims their best bandwidth is when you use read/write buffer where the host memory is "pinned", which can be achieved by creating a buffer with CL_MEM_ALLOC_HOST_PTR and mapping it (I think your example is that). You should also compare that to just mapping and unmapping the device memory; their more recent drivers have gotten better at that.

With AMD you can just map/unmap the device buffer to get full speed. They also have a bunch of vendor-specific buffer flags which can make certain scenarios faster; you should study them but more importantly create benchmarks that try out everything to see what actually works best for your task.

With both discrete devices you should use separate command queues for the transfer operations so they can overlap with other (non-dependent) compute operations (look up various compute overlap examples). Furthermore, some higher end discrete GPUs can be downloading one buffer at the same time they are uploading another (using dual DMA engines), so you could be uploading one batch of work while you're computing another while you're downloading the result of a third. When written elegantly, this isn't even much more code than the strictly sequential version, but you have to use OpenCL events to synchronize between command queues. NVIDIA has a GTC talk you can watch that shows how to do this for video frames every 16 ms.

With AMD's APU and with Intel's Integrated Graphics, the map/unmap of the "device" buffer is "free" since it is in main memory. Don't use read/write buffer here or you'll be paying for unneeded transfers.