In the cuda C programming guide, stream is defined very abstractly: a sequence of cuda operations that are executed in order they are issued by the code.
My understanding of how instructions are executed in Nvidia GPU is: when a kernel is launched, the blocks are distributed to SMs in the device. Then the warps ( groups of 32 threads ) are schedueled by a warp schedueler in the SM for instructions to be processed warp-wise.
So, if two kernels are launched in the same stream, then the first is processed before the second ( since the instructions are processed in the order they are put in the stream ). Does that mean two kernels end up only using hardware resource of one kernel? Or does each kernel have their own resources, but the second one is pending until the first is complete?
And in general, how are streams implemented in hardware? I assume it provides ordering to the warp scheduler ( but then a warp scheduler is per-SM based, so how would this allow multi-SM kernels to use stream?).