I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010.
My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver.
One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host.
As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in 'Stream 2'. That's why the asyncMemcpy is synchronized with the CPU before launch of the Kernels in the different streams.
What irritates me in the picture is the big gap between the end of the first kernel launch (at 10.5778679285) and the beginning of the kernel execution (at 10.5781500). It takes around 300 us to launch the kernel which is a huge overhead in a processing cycle of less than 1 ms.
Furthermore there is no overlapping of kernel execution and the data copy of the results back to the host, which increases the overhead even more.
Are there any obvious reasons for this behavior?