You can run the devicequery example from the sample to determine the max number of blocks. HERE Inside a each block you can have maximum 1024 threads.
How many blocks executing on a SM(Streaming multiprocessor)? Each SM can have upto 16 active blocks on Kepler and 8 active blocks on Fermi.
Also you need to think in terms of warps. One warp = 32 threads. In a Fermi, the number of active warps is 48 and in Kepler its 64 per SM. These are ideal numbers. The actual number of warps executing on a SM will depend on the Launch configuration and number of resources you are using in a kernel.
Usually you will calculate occupancy = active warps / number of max active warps.
If you have N blocks then the total shared memory is divided by N. If you want to have large number of blocks then you may want to check the occupancy calculator spread sheet to check how much of shared memory you can use without affecting the performance.
But,
__shared__ float BlockData[TILE_DIM][TILE_DIM];
is allocated per block so you have the whole chunk available in each block.