5
votes

I'm working on translating a CUDA application (this if you must know) to OpenCL. The original application uses the C-style CUDA API, with a single stream just to avoid the automatic busy-wait when reading the results.

Now I notice that OpenCL command queues look a lot like CUDA streams. But in the device read command, and likewise in the write and kernel execute commands, I notice parameters for events too. So I'm wondering, what does it take to execute a device write, a number of kernels (e.g. one call to one kernel then 100 calls to another kernel), and a device read, all sequentially?

  1. If I just enqueue them sequentially into the same queue, will they execute sequentially like they do in CUDA?
  2. If that doesn't work, can/should I daisy-chain events, making each call's wait list the previous call's event?
  3. Or should I add all previous events to each call's wait list, like if there's an N^2 search for dependencies or something?
  4. Or do I just have to event.wait() for each call individually, like it says to in AMD's tutorial?

Thanks!

1

1 Answers

5
votes

That depends on how you create the command queue. in clCreateCommandQueue there's a properties parameter that can contain CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, which enables non-sequential execution in the command queue.

If that property is set, commands might execute out of order or in parallel, and the only way of synchronize them is using events.

When that property is not set, commands execute sequentially in the queue.