I have an application that I designed to run on AMD GPU's with OpenCL. Finally got the app running and bug free (haha) yesterday, targted on a single GPU. Now that the app works, it's time to scale it to multiple GPUs.
Read a lot about how to set it up. We're using the single context, multiple queue method.
I pull the list of devices, and choose 2 of the GPUs and create a context containing them, then a single BuildProgram containing both devices as well. Create two separate queues.
Pseudocode of the original, working app, now converted to handle 2 gpus:
context = clCreateContext(0, 2, device_list, NULL, NULL, &ret);
for(x = 0; x < 2; x++)
queue[x] = clCreateCommandQueue(context, device_list[x], ENABLE_PROFILING, &ret);
clBuildProgram(program, 2, device_list, options, NULL, NULL);
create kernels..
run...
for(outer_loop = 0; outer_loop < 10; outer_loop++) {
clEnqueueNDRangeKernel(queue[0], kernel_init, offset, &event[0]);
clEnqueueNDRangeKernel(queue[1], kernel_init, different_offset, &event[1]);
clFinish(queue[0]);
clFinish(queue[1]);
get profiling data and printf results
}
Thats basically how the code looks. Arguments are set and Writes are done BEFORE the loop - the init kernel does not rely on input to start working. After it runs, it DOES do a async_work_group_copy of it's generated data to a global buffer.
Now, before I modified the code for 2 GPUs, the kernel ran in 27ms (for each loop)
After I modified the code, if I comment out ONE or the OTHER of the 2 kernel runs (the EnqueueNDRangeKernel and the associated clFinish), they will both run in 27ms.
If I run the code to run on both GPUs in parallel, I get very odd behavior.
The first run in the loop, they both execute in about 37-42ms individually. I'm ok with a slight slowdown, as I'm getting twice the work done. But after the first run, one or the other kernel will randomly have a 4-5 SECOND delay between being queued, and starting.
Here's the output of my profiling/timing for it. All numbers are in ms.
Q0: til sub: 8.8542 til start: 9.8594 til fin: 47.3749
Q1: til sub: 0.0132 til start: 13.4089 til fin: 39.2364
Q0: til sub: 0.0072 til start: 0.2310 til fin: 37.1187
Q1: til sub: 0.0122 til start: 4152.4638 til fin: 4727.1146
Q0: til sub: 0.0302 til start: 488.6218 til fin: 5049.7233
Q1: til sub: 0.0179 til start: 5023.9310 til fin: 5049.7762
Q0: til sub: 0.0190 til start: 2.0987 til fin: 39.4356
Q1: til sub: 0.0164 til start: 3996.2654 til fin: 4571.5866
Q0: til sub: 0.0284 til start: 488.5751 til fin: 5046.3555
Q1: til sub: 0.0176 til start: 5020.5919 til fin: 5046.4382
The machine I'm running this on has 5 GPUs in it. Regardless of which two I use, one of the two GPUs (its not always the same one) gets a 4-5 second delay on starting. Use a single GPU - no delay.
What could be causing this? Any idea? I'm not blocking - the clFinish is only to get profiling info. Even if it were blocking it wouldn't be a 5 second delay.
Also - I thought maybe the writes to global that the kernel was doing might have been part of the delay. I commented the writes out. Nope. No change.
In fact, I added a return; as the first line of the kernel - so it does absolutely nothing. The 40ms dropped to .25, but the 5 second delay was still there.