4
votes

I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010.

My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver.

One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host.

As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in 'Stream 2'. That's why the asyncMemcpy is synchronized with the CPU before launch of the Kernels in the different streams.

enter image description here

What irritates me in the picture is the big gap between the end of the first kernel launch (at 10.5778679285) and the beginning of the kernel execution (at 10.5781500). It takes around 300 us to launch the kernel which is a huge overhead in a processing cycle of less than 1 ms.

Furthermore there is no overlapping of kernel execution and the data copy of the results back to the host, which increases the overhead even more.

Are there any obvious reasons for this behavior?

1

1 Answers

6
votes

There are three issues that I can tell by the trace.

  1. Nsight CUDA Analysis adds about 1 µs per API call. You have both CUDA runtime and CUDA Driver API trace enabled. If you were to disable CUDA runtime trace I would guess that you would reduce the width by 50 µs.

  2. Since you are on GTX 480 on Windows 7 you are executing on the WDDM driver model. On WDDM the driver must make a kernel call to submit work which introduces a lot of overhead. To avoid reduce this overhead the CUDA driver buffers requests in an internal SW queue and sends the requests to the driver when the queue is full you it is flushed by a synchronize call. It is possible tu use cudaEventQuery to force the driver to flush the work but this can have other performance implications.

  3. It appears you are submitting your work to streams in a depth first manner. On compute capability 2.x and 3.0 devices you will have better results if you submit to streams in a breadth first manner. In your case you may see overlap between your kernels.

The timeline screenshot does not provide sufficient information for me to determine why the memory copies are starting after completion of all of the kernels. Given the API call pattern I you should be able to see transfers starting after each streams completes its launch.

If you are waiting on all streams to complete it is likely faster to do a cudaDeviceSynchronize than 4 cudaStreamSynchronize calls.

The next version of Nsight will have additional features to help understand the SW queuing and the submission of work to the compute engine and the memory copy engine.