0
votes

In my program, I have two kernels, and for each kernels I launch only two blocks of 256 threads.

kernel1<<<2,256>>>()
kernel2<<<2,256>>>()

On a 4 SMX graphic cards the current execution of the program is something like that (when profiling with visual profiler, both kernel execut themself one after the other)

 SMX1 SMX2
 ---------
| K1 | K1 |  
| K1 | K1 |
-----------
|    |    |
|    |    |
 ---------
 SMX3 SMX4

 SMX1 SMX2
 ---------
| K2 | K2 |  
| K2 | K2 |
-----------
|    |    |
|    |    |
 ---------
 SMX3 SMX4

I was wondering if it was possible to lauch both kernel at the same time within the same program, and have something that look like that and divide the execution time by 2 :

 SMX1 SMX2
 ---------
| K1 | K1 |  
| K1 | K1 |
-----------
| K2 | K2 |
| K2 | K2 |
 ---------
 SMX3 SMX4

In kepler 3.5 : The new "Hyper-Q" feature in the kepler architecture allow to simultaneously lunch multiple kernels from multiple MPI process. (or other process)

1
You may find you answer using device partitioning feature in OpenCL.lashgar

1 Answers

2
votes

It should be possible to get the 2 kernels executing simultaneously. For starters, you will need to launch the two kernels in separate streams. Whether it divides the execution time by 2, I can't say. You may want to look at the concurrent kernels CUDA sample or any of the samples that involve streams.