I have two kernels (A and B) that can be executed concurrently. I need kernel A to finish as soon as possible (to do MPI exchange of the result). So I can execute them in one stream: A and then B.
However, kernel A has few thread blocks, so if I run A and B sequentially, GPU is not fully utilized while A is running.
Is it possible to execute A and B concurrently with A having higher priority?
I. e., I want thread blocks from kernel B to start executing only if there are no non-started blocks from kernel A.
As I understand, if I start kernel A in one stream, and, next line in host code, start kernel B in another stream, I am not guaranteed that thread blocks from B will not actually be executed first?