What is the best way (in any sense) of allocating memory for OpenCL output data? Is there a solution what works reasonably with both discrete and integrated graphics?
As a super-simplified example, consider the following C++ (host) code:
std::vector<float> generate_stuff(size_t num_elements) {
std::vector<float> result(num_elements);
for(int i = 0; i < num_elements; ++i)
result[i] = i;
return result;
}
This can be implemented using an OpenCL kernel:
__kernel void gen_stuff(float *result) {
result[get_global_id(0)] = get_global_id(0);
}
The most straightforward solution is to allocate an array on both the device and host, then copy after kernel finished:
std::vector<float> generate_stuff(size_t num_elements) {
//global context/kernel/queue objects set up appropriately
cl_mem result_dev = clCreateBuffer(context, CL_MEM_WRITE_ONLY, num_elements*sizeof(float) );
clSetKernelArg(kernel, 0, sizeof(cl_mem), result_dev);
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &num_elements, nullptr, 0, nullptr, nullptr);
std::vector<float> result(num_elements);
clEnqueueReadBuffer( queue, result_dev, CL_TRUE, 0, num_elements*sizeof(float), result_host.data(), 0, nullptr, nullptr );
return result;
}
This works reasonably with discrete cards. But with shared memory graphics, this means allocating double and an extra copy. How can one avoid this? One thing for sure, one should drop clEnqueuReadBuffer
and use clEnqueueMapBuffer
/clUnmapMemObject
instead.
Some alternative scenarios:
- Deal with an extra memory copy. Acceptable if memory bandwidth is not an issue.
- Allocate a normal memory array on host, use
CL_MEM_USE_HOST_PTR
when creating the buffer. Should allocate with device-specific alignment - it is 4k with Intel HD Graphics: https://software.intel.com/en-us/node/531272 I am not aware if this is possible to query from the OpenCL environment. Results should be mapped (withCL_MAP_READ
) after kernel finishes to flush caches. But when is it possible to unmap? Immediately after mapping is finished (it seems that does not work with AMD discrete graphics)? Deallocation of the array also requires modification of client code on Windows (due to _aligned_free being different from free). - Allocate using
CL_MEM_ALLOCATE_HOST_PTR
and map after kernel finishes. The cl_mem object has to be kept alive till the buffer is used (and probably even mapped?), so it requires polluting client code. Also this keeps the array in a pinned memory, what might be undesirable. - Allocate on device without
CL_MEM_*_HOST_PTR
, and map it after kernel finishes. This is the same thing as option 2 from deallocation's perspective, it's just avoiding pinned memory. (Actually, not sure if memory that is mapped isn't pinned.) - ???
How are you dealing with this problem? Is there any vendor-specific solution?