In my program, I have two kernels, and for each kernels I launch only two blocks of 256 threads.
kernel1<<<2,256>>>()
kernel2<<<2,256>>>()
On a 4 SMX graphic cards the current execution of the program is something like that (when profiling with visual profiler, both kernel execut themself one after the other)
SMX1 SMX2
---------
| K1 | K1 |
| K1 | K1 |
-----------
| | |
| | |
---------
SMX3 SMX4
SMX1 SMX2
---------
| K2 | K2 |
| K2 | K2 |
-----------
| | |
| | |
---------
SMX3 SMX4
I was wondering if it was possible to lauch both kernel at the same time within the same program, and have something that look like that and divide the execution time by 2 :
SMX1 SMX2
---------
| K1 | K1 |
| K1 | K1 |
-----------
| K2 | K2 |
| K2 | K2 |
---------
SMX3 SMX4
In kepler 3.5 : The new "Hyper-Q" feature in the kepler architecture allow to simultaneously lunch multiple kernels from multiple MPI process. (or other process)