0
votes

This question is about notion of a CUDA stream (Stream) and the apparent anomaly with CUDA events (Event) recorded on a stream.

Consider the following code demonstrating this anamoly,

cudaEventRecord(eventStart, stream1)
kernel1<<<..., stream1>>>(...)
cudaEventRecord(eventBetween, stream1)
kernel2<<<..., stream1>>>(...)
cudaEventRecord(eventEnd, stream1)

Since all work on a Stream is sequential, the following sequence of events should've happened,

  1. "eventStart" is recorded to be complete
  2. kernel1 runs and completes
  3. "eventBetween" is recorded to have completed
  4. kernel2 runs and completes
  5. "eventEnd" is recorded to have completed

After synchronizing the host thread with the device, the time taken between eventStart and eventBetween ( measured by cudaEventElapsedTime ) should be run-time of kernel1 and the time taken between eventBetween and eventEnd should be run-time of kernel2.

But according to NVIDIA's documentation on cudaStreamWaitEvent,

Makes all future work submitted to stream wait until event reports completion before beginning execution.

and blogpost,

You can also synchronize operations within a single stream on a specific event using cudaStreamWaitEvent(event)

Events needn't report completion before all work that's scheduled after recording the Event on the same Stream. This comes as a surprise since all work scheduled on a Stream is supposed to run in a sequential fashion.

Can someone clarify this asynchronous behavior of Events within a Stream ?

1

1 Answers

1
votes

It appears the cause of your confusion is conflating host-stream synchronisation and stream-stream synchronisation.

In short:

  1. Streams are FIFO, and all operations within a stream are synchronous with respect to one another.
  2. The documentation you are referring to is describing a use case where the programmer wishes to synchronise between events in different streams.
  3. The blog post you refer to is discussing synchronisation between the calling host thread and an event enqueued in a stream. There are three levels of host-device synchronisation granularity available -- the whole device, a whole stream, or an event within a stream. The blog is describing the latter.

There is no contradiction between any of the sources you cited.