For my CUDA development, I am using a machine with 16 cores, and 1 GTX 580 GPU with 16 SMs. For the work that I am doing, I plan to launch 16 host threads (1 on each core), and 1 kernel launch per thread, each with 1 block and 1024 threads. My goal is to run 16 kernels in parallel on 16 SMs. Is this possible/feasible?
I have tried to read as much as possible about independent contexts, but there does not seem to be too much information available. As I understand it, each host thread can have its own GPU context. But, I am not sure whether the kernels will run in parallel if I use independent contexts.
I can read all the data from all 16 host threads into one giant structure and pass it to GPU to launch one kernel. However, it will be too much copying and it will slow down the application.