1
votes

I've created an in-order OpenCL queue. My pipeline enqueues multiple kernels into the queue.

    queue = clCreateCommandQueue(cl.context, cl.device, 0, &cl.error);

    for(i=0 ;i < num_kernels; i++){
        clEnqueueNDRangeKernel(queue, kernels[i], dims, NULL, global_work_group_size, local_work_group_size, 0, NULL, &event);
    }

The output of kernels[0] is intput to kernels[1]. Output of kernels[1] is input to kernels[2] and so on.

Since my command queue is an in-order queue, my assumption is kernels[1] will start only after kernels[0] is completed.

  1. Is my assumption valid?
  2. Should I use clWaitForEvents to make sure the previous kernel is completed before enqueuing the next kernel?
  3. Is there any way I can stack multiple kernels into the queue & just pass the input to kernels[0] & directly get the output from the last kernel? (without having to enqueue every kernel one by one)
1

1 Answers

1
votes

Your assumption is valid. You do not need to wait for events in an in-order queue. Take a look at the OpenCL doc:

https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html

If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed. If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A. If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a commandqueue is set, then there is no guarantee that kernel A will finish before kernel B starts execution.

As to the other question: yes, you'll need to enqueue every kernel that you want to run explicitly. Consider it a good thing, as there is no magic happening.

Of course you can always write your own helpers in C/C++ (or whatever host language you are using) that simplify this, and potentially hide the cumbersome kernel calls. Or use some GPGPU abstraction library to do the same.