2
votes

I've been writing some basic OpenCL programs and running them on a single device. I have been timing the performance of the code, to get an idea of how well each program is performing.

I have been looking at getting my kernels to run on the platforms GPU device, and CPU device at the same time. The cl::context constructor can be passed a std::vector of devices, to initialise a context with multiple devices. I have a system with a single GPU and a single CPU.

Is constructing a context with a vector of available devices all you have to do for kernels to be distributed to multiple devices? I noticed a significant performance increase when I did construct the context with 2 devices, but it seems too simple.

There is a DeviceCommandQueue object, perhaps I should be using that to create a queue for each device explicitly?

2

2 Answers

1
votes

I did some testing on my system. Indeed you can do something like this:

using namespace cl;
Context context({ devices[0], devices[1] });
queue = CommandQueue(context); // queue to push commands for the device, however do not specify which device, just hand over the context
Program::Sources source;
string kernel_code = get_opencl_code();
source.push_back({ kernel_code.c_str(), kernel_code.length() });
Program program(context, source);
program.build("-cl-fast-relaxed-math -w");

I found that if the two devices are from different platforms (like one Nvidia GPU and one Intel GPU), either clCreateContext will throw read access violation error at runtime or program.build will fail at runtime. If however the two devices are from the same platform, the code will compile and run; but it won't run on both devices. I tested with an Intel i7-8700K CPU and its integraed Intel UHD 630 GPU, and no matter the order of the devices in the vector that context is created on, the code will always be executed on the CPU in this case. I checked with Windows Task-Manager and also the results from kernel execution time measurement (execution times are specific for each device).

You could also monitor device usage with some tool like Task-Manager to see which device is actually running. Let me know if it is any different on your system than what I observed.

Generally parallelization across multiple devices is not done by handing the context a vector of devices, but instead you give each device a dedicated context and queue and explicitely handle which kernels are executed on which queue. This gives you full control on memory transfers and execution order / synchronization points.

0
votes

I was seeing a performance increase when passing in the vector of devices. I downloaded a CPU/GPU profiler to actually check the activity of my GPU and CPU while I was running the code and it seemed that I was seeing activity on both devices. The CPU was registering around 95-100% activity, and the GPU was getting up to 30-40%, so OpenCL must be splitting the kernels between the 2 devices. My computer has a CPU with an integrated GPU, which may play a role in why the kernels are being shared across the devices, because its not like it has a CPU and a completely seperate GPU, they're connected on the same component.