If you have multiple consecutive CUDA events (in a single stream) that you'd like to time (e.g. cudaMemcpy followed by a kernel launch followed by another cudaMemcpy), is it safe/proper/accurate to synchronize only on the last event? For example:
cudaEventRecord(event1_start);
// do something
cudaEventRecord(event1_stop);
cudaEventRecord(event2_start);
// do something else
cudaEventRecord(event2_stop);
cudaEventSynchronize(event2_stop);
cudaEventElapsedTime(&time1, event1_start, event1_stop);
cudaEventElapsedTime(&time2, event2_start, event2_stop);
My understanding is that these events and actual cuda calls get placed into a FIFO queue. The CPU then needs to only wait until the last event is recorded before it records timings for all. Is this correct?
Thanks!