I'm currently playing with the new CUDA Dynamic Parallelism (CDP) feature introduced in CUDA 5.0. I picked the N-queens puzzle as an example for tree search algorithms with high work-imbalance, which, in my opinion, may benefit from CDP.
My approach is roughly as follows: For a given board configuration (a chess board with a certain numbers of queens already placed in the first rows) I start a kernel with a number of threads. Each thread tries one possible branch of the sub-tree below the given configuration up to a given max. depth. If the leave of the branch still represents a valid configuration, that thread spawns a child grid of threads that search the sub-tree that is then based on that configuration. Threads that find their configuration to be invalid (two or more queens could attack each other) will terminate. If a thread successfully placed the last queen on the board it increments the solution counter.
Before launching the kernels I pre-calculate some board configurations on the CPU and then launch a grid for each of those configuration.
Now to the problem: I found my solution to be significantly slower then another CUDA implementation that does not use CDP. So I started the Nsight profiler to find the reason. Here is my first result (for N=10):
Obviously the GPU is not fully occupied. So I figured that I need to use different streams for launching the child grids in order to prevent them from waiting on each other. Here is the profiling result when using a new stream for each child grid launch:
This looks more dense (and is faster), but I still don't quite understand the reason for this pattern. Why are there so many gabs between some launches (especially in the end)?
But it gets even stranger. When I increment the N (and thus the workload) to 13, the pattern looks as follows:
Does someone know how CDP works internally? Are there any implicit synchronization barriers I did not consider yet? Or am I reading the profiler output wrong? I'm particularly curious what this one thread spanning over almost the entire execution time in the last case could be.
I also didn't find any documentation for the Nsight Visual Profiler concerning the output for CDP. Any good references about what Nsight is showing there would help as well.
Thanks!