Each block deployed for execution to SM requires certain resources, either registers or shared memory. Let's imagine following situation:
- each thread from certain kernel is using 64 32b registers (256B register memory),
- kernel is launched with blocks of size 1024 threads,
- obviously such block would consume 256*1024B of registers on particular SM
I don't know about tegra, but in case of card which I am using now (GK110 chip), every SM has 65536 of 32-bit registers (~256kB) available, therefore in following scenario all of the registers would got used by single block deployed to this SM, so limit of blocks per SM would be 1 in this case...
Example with shared memory works the same way, in kernel launch parameters you can define amount of shared memory used by each block launched so if you would set it to 32kB, then two blocks could be deployed to SM in case of 64kB shared memory size. Worth mentioning is that as of now I believe only blocks from same kernel can be deployed to one SM at the same time.
I am not sure at the moment whether there is some other blocking factor than registers or shared memory, but obviously, if blocking factor for registers is 1 and for shared memory is 2, then the lower number is the limit for number of blocks per SM.
As for your second question, how much threads can run concurrently, the answer is - as many as there are cores in one SM, so in case of SMX and Kepler architecture it is 192. Number of concurrent warps is obviously 192 / 32.
If you are interested in this stuff I advise you to use nsight profiling tool where you can inspect all kernel launches and their blocking factors and many more useful info.
EDIT:
Reading Robert Crovella's answer I realized there really are these limits for blocks per SM and threads per SM, but I was never able to reach them as my kernels typically were using too much registers or shared memory. Again, these values can be investigated using Nsight which displays all the useful info about available CUDA devices, but such info can be found for example in case of GK110 chip even on NVIDIA pages in related document.