1
votes

I am using beta support for OpenCL 2.0 on NVIDIA and targeting highend GPU like 1080ti. In my compute pipeline, I need to sometimes dispatch work to independently image process relatively small images. In theory, I think these images should be able to be processed in parallel on a single GPU because the amount of work groups for a single image won't saturate all the compute units of the GPU.

  1. Is this possible in OpenCL? Does this have a name in OpenCL?

  2. If it is possible, is using multiple queues for a single device the only way to do this? Or will the driver look at the "waitEventList" and decide which kernels can be processed in parallel?

  3. Do I need CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE?

1

1 Answers

3
votes

1- Yes, this is one of ways to achieve high yield on occupation of compute units. General name can be "pipelining"(with help of asynchronous enqueueing and/or dynamic parallelism). There are different ways, one is doing reads on 1 queue, doing writes on another queue, doing compute on a third queue with 3 queues in control with wait events; second way could be having M queues each doing a different image's read-compute-write work without events.

2- You can even use single queue but an out-of-ordered type so kernels are dispatched independently. But at least for some amd cards, even an in-order queue can optimize independent kernels (according to amd's codexl) with concurrent execution(this may be out of opencl specs). Wait events can be a constraint to stop this type of driver-side optimizations(again, at least on amd)

From 2.x onwards, there is device-side queueing ability so you can enqueue 1 kernel from host and that kernel can enqueue N kernels, independently of host intervention(if all data is already uploaded to card), this may not be as latency-hiding as using multiple host-side queues(if data is needed from host to device).

3- Out of order execution is not forced on vendors so this may not work.