Cuda streams and memorycpyasync as far as I know, need us to label different kernels, memory operations to different streams in order to make the gpu operations concurrent with cpu operations.
But is it possible to have one persistent kernel. This kernel launches one time, looping forever, checking "some flags" to see if there are a piece of data coming from CPU then operating on it. When this "piece of " data finishes, GPU set a "flag" to CPU, CPU sees it and copy the data back. This Kernel shall never finishes running.
Does this exist in current cuda programming model? What will be the closest to this I can get?