3
votes

What is the best way (in any sense) of allocating memory for OpenCL output data? Is there a solution what works reasonably with both discrete and integrated graphics?

As a super-simplified example, consider the following C++ (host) code:

std::vector<float> generate_stuff(size_t num_elements) {
    std::vector<float> result(num_elements);
    for(int i = 0; i < num_elements; ++i)
        result[i] = i;
    return result;
}

This can be implemented using an OpenCL kernel:

__kernel void gen_stuff(float *result) {
    result[get_global_id(0)] = get_global_id(0);
}

The most straightforward solution is to allocate an array on both the device and host, then copy after kernel finished:

std::vector<float> generate_stuff(size_t num_elements) {
    //global context/kernel/queue objects set up appropriately
    cl_mem result_dev = clCreateBuffer(context, CL_MEM_WRITE_ONLY, num_elements*sizeof(float) );
    clSetKernelArg(kernel, 0, sizeof(cl_mem), result_dev);
    clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &num_elements, nullptr, 0, nullptr, nullptr);
    std::vector<float> result(num_elements);
    clEnqueueReadBuffer( queue, result_dev, CL_TRUE, 0, num_elements*sizeof(float), result_host.data(), 0, nullptr, nullptr );
    return result;
}

This works reasonably with discrete cards. But with shared memory graphics, this means allocating double and an extra copy. How can one avoid this? One thing for sure, one should drop clEnqueuReadBuffer and use clEnqueueMapBuffer/clUnmapMemObject instead.

Some alternative scenarios:

  1. Deal with an extra memory copy. Acceptable if memory bandwidth is not an issue.
  2. Allocate a normal memory array on host, use CL_MEM_USE_HOST_PTR when creating the buffer. Should allocate with device-specific alignment - it is 4k with Intel HD Graphics: https://software.intel.com/en-us/node/531272 I am not aware if this is possible to query from the OpenCL environment. Results should be mapped (with CL_MAP_READ) after kernel finishes to flush caches. But when is it possible to unmap? Immediately after mapping is finished (it seems that does not work with AMD discrete graphics)? Deallocation of the array also requires modification of client code on Windows (due to _aligned_free being different from free).
  3. Allocate using CL_MEM_ALLOCATE_HOST_PTR and map after kernel finishes. The cl_mem object has to be kept alive till the buffer is used (and probably even mapped?), so it requires polluting client code. Also this keeps the array in a pinned memory, what might be undesirable.
  4. Allocate on device without CL_MEM_*_HOST_PTR, and map it after kernel finishes. This is the same thing as option 2 from deallocation's perspective, it's just avoiding pinned memory. (Actually, not sure if memory that is mapped isn't pinned.)
  5. ???

How are you dealing with this problem? Is there any vendor-specific solution?

2
A well-posed question. In general, your intention seems to be related to the tern "zero copy" (see stackoverflow.com/q/12766578/3182664 or stackoverflow.com/a/19974674/3182664 , and developer.amd.com/tools-and-sdks/opencl-zone/… ), but the heterogenity makes it hard to grasp. I'm also looking forward to see a good answer here.Marco13

2 Answers

2
votes

You can do it with a single buffer, for both discrete and integrated hardware:

  1. Allocate with CL_MEM_WRITE_ONLY (since your kernel only writes to the buffer). Optionally also use CL_MEM_ALLOCATE_HOST_PTR or vendor-specific (e.g., AMD) flags if it helps performance on certain platforms (read the vendor guidance and do benchmarking).
  2. Enqueue your kernel that writes to the buffer.
  3. clEnqueueMapBuffer with CL_MAP_READ and blocking. On discrete hardware this will copy over PCIe; on integrated hardware it's "free".
  4. Use the results on the CPU using the returned pointer.
  5. clEnqueueUnmapMemObject.
1
votes

Depends on the use case:

  1. For minimal memory footprint and IO efficiency: (Dithermaster's answer)
    • Create with CL_MEM_WRITE_ONLY flags, or maybe CL_MEM_ALLOCATE_HOST_PTR (depending on platforms). Blocking map for reading, use it, un-map it. This option requires that the data handler (consumer), knows about the CL existance, and unmaps it using CL calls.
  2. For situations where you have to provide a buffer data to a third party (ie: libraries that need a C pointer, or class buffer, agnostic to CL):
    • In this case it may not be good to use mapped memory. Mapped memory access time is typically longer compared to normal CPU memory. So, instead of mapping, then memcpy() and the unmap; it is easier to directly perform a clEnqueueReadBuffer() to the CPU address where the output should be copied. In some vendor cases, this does not provide pinned memory and the copy is slow, so is better to revert to the option "1". But for some other cases where there is no pinned memory I found it faster.
  3. Any other different condition for reading the kernel output? I think not...