Yes, you can get parallel execution between multiple blocks if the CUDA device has multiple warp schedulers.
CUDA devices with compute capability 2.1 have two warp schedulers, so an instruction from two different warps (from the same block or from different blocks, doesn't matter) can execute concurrently.
CUDA devices with compute capability 3.0 have four warp schedulers, and can issue two independent instructions per warp that is ready to execute.
Note that even without concurrent execution between warps, it is advantageous to have multiple blocks available to the scheduler so that if a warp is blocked waiting for a memory operation to complete, the scheduler can switch to another warp for execution so the cores don't sit idle.
The number of warps that can be resident on a core ready for the scheduler to switch to varies by compute capability.
If you only define as many blocks as you have schedulers, you will not be able to achieve the full compute potential of your device. This is particularly true if your code has a lot of memory I/O - one way to "hide" memory latency is to make sure there are enough blocks/warps available so the scheduler(s) always have a ready warp to switch to when one of the warps goes idle waiting for a memory I/O.
Whenever you have multiple warps reading and writing the same memory address, you should use atomic I/O or take a lock, regardless of whether your current hardware can execute multiple warps concurrently. Write-after-write artifacts ("lost updates") can manifest even in task-switched single core execution.