Kernel invoking delay on CUDA with Streams

Question

I have created the Scan Algorithm for CUDA from scratch and trying to use it for smaller data amounts less than 80,000 bytes.

Two separate instances were created where, one runs the kernels using streams where possible and the other runs only in the default stream.

What Ive observed is that for this range of data sizes, running with streams takes longer time to complete the task compared with the other method.

When analysed using the nvprofiler, what was observed is that for smaller amount of data sizes, running in streams will not provide parallelism for separate kernals

Without Streams

With Streams

But when the data size is increased some kind of parallelism could be obtained

With Streams for 400,000bytes

My problem is, is there some additional parameters to reduce this kernel invoking time delays or is it normal to have this kind of behavior for smaller data sizes where using streams are disadvantageous

UPDATE :

I've included the Runtime API calls timeline as well to clarify the answer

kangshiyin kangshiyin · Accepted Answer · 2016-06-08T11:51:42

Generally your data is too small to fully utilize the GPU in your first case. If you check the timeline of 'Runtime API' in nvvp, which you did not show in your figures, you will find launching a kernel take a few microseconds. If your first kernel in stream 13 is too short, the second kernel in stream 14 may not be launched yet, thus there's no parallelism across streams.

Because of these overheads, you may find it even faster to run your program on CPU if the data is small.

Kernel invoking delay on CUDA with Streams

1 Answers