All of the blocks running on a multiprocessor must share all resources (registers, shared memory, etc.)
If your threadblock uses shared memory, the first rule it must satisfy is that it cannot use more than what is available in the SM (i.e. 16KB in this case).
If the threadblock requires less than 16KB, then it may be possible to have multiple threadblocks executing on the SM. For example, two threadblocks could be executing if each only uses approximately 8KB. Four threadblocks could be executing if each only used at most (slightly less than) 4KB (there is some overhead, usually).
If you wanted the maximum of 8 threadblocks to be able to execute at once on a given SM (multiprocessor), then you would have to ensure in your code that the threadblock uses no more than 2KB of shared memory (probably a little less than 2KB).
If each threadblock used 16KB of shared memory, it simply means that additional threadblocks will wait in a queue until that threadblock is finished on that SM, before they begin to execute.
If a threadblock attempted to use more than 16KB (in this case) you would get a kernel launch error.