0
votes

I have created the Scan Algorithm for CUDA from scratch and trying to use it for smaller data amounts less than 80,000 bytes.

Two separate instances were created where, one runs the kernels using streams where possible and the other runs only in the default stream.

What Ive observed is that for this range of data sizes, running with streams takes longer time to complete the task compared with the other method.

When analysed using the nvprofiler, what was observed is that for smaller amount of data sizes, running in streams will not provide parallelism for separate kernals

Without Streams Scan Without Streams

With Streams Scan With Streams

But when the data size is increased some kind of parallelism could be obtained

With Streams for 400,000bytes With Streams for 400,000

My problem is, is there some additional parameters to reduce this kernel invoking time delays or is it normal to have this kind of behavior for smaller data sizes where using streams are disadvantageous

UPDATE :

I've included the Runtime API calls timeline as well to clarify the answer

With Streams with the Runtime API

1

1 Answers

2
votes

Generally your data is too small to fully utilize the GPU in your first case. If you check the timeline of 'Runtime API' in nvvp, which you did not show in your figures, you will find launching a kernel take a few microseconds. If your first kernel in stream 13 is too short, the second kernel in stream 14 may not be launched yet, thus there's no parallelism across streams.

Because of these overheads, you may find it even faster to run your program on CPU if the data is small.