I've got a NVIDIA GT650M, with the following properties :
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
Maximum number of threads per multiprocessor: 2048
I just come out of the confusion between streaming multiprocessor (SM), and the actual multiprocessors. SMs and multiprocessors are different things, right? For example, using the visual profiler, I've got a dummy kernel which only waits and last 370ms when launched with 1 block of 1 thread. I can launch it with 4 blocks of 1024 threads with one SM, it still lasts 370ms. This is normal because the task uses the 2 multiprocessors of the chip, each one using 2048 concurrent threads (as soon as I use 5 blocks x 1024, it takes 740ms, normal). Similarly, I can launch concurrently 4 times one block of 1024 threads using 4 SMs, it still takes 370ms, ok.
This first part of the question was just to be sure that we shouldn't confuse SMs and multiprocessors? Like I see sometimes even in answers like here: CUDA - Multiprocessors, Warp size and Maximum Threads Per Block: What is the exact relationship? As a result, one cannot explicitly control the way that the tasks are scheduled though the multiprocessors, because (as far as I know) no runtime function permit it right? So, if I have a card with 2 mutliprocessors and 2048 thread per multiprocessor, or another one with 4 multiprocessors with 1024 threads each, a given program will get executed the same way?
Secondly, I wanted to know what is better for which usage, having more multiprocessors with few cores, or the reverse? So far, my understanding makes me say that more multiprocessors (for a given maximum thread per multiprocessor) with few cores will be more suited to more massive parallelism with few/simple operations, while with more cores per multiprocessor (now I'm talking about things I barely know) there will be more dedicated ALUs for load/store operations and complex mathematics functions, so it will be more suited for kernels requiring more operations per thread?