so does this mean that only 4 warps can be worked on simultaneously within a GPU kernel?
Instructions from up to 4 warps can be scheduled in any given clock cycle on a kepler SMX. However due to pipelines in execution units, in any given clock cycle, instructions may be in various stages of pipeline execution from any and up to all warps currently resident on the SMX.
And if so, does this mean I should really be looking for blocks that have multiples of 128 threads (assuming no divergence in threads) as opposed to the recommended 32?
I'm not sure how you jumped from the previous point to this one. Since instruction mix presumably varies from warp to warp (since different warps are presumably at different points in the instruction stream) and instruction mix varies from one place to another in the instruction stream, I don't see any logical connection between 4 warps schedulable in a given clock cycle, and any need to have groups of 4 warps. A given warp may be at a point where its instructions are highly schedulable (perhaps at a sequence of SP FMA, requiring SP cores, which are plentiful), and another 3 warps may be at another point in the instruction stream where their instructions are "harder to schedule" (perhaps requiring SFUs, which there are fewer of). Therefore arbitrarily grouping warps into sets of 4 doesn't make much sense. Note that we don't require divergence for warps to get out of sync with each other. The natural behavior of the scheduler coupled with the varying availability of execution resources could create warps that were initially together, to be at different points in the instruction stream.
For your second question, I think your fundamental knowledge gap is in understanding how a GPU hides latency. Suppose a GPU has a set of 3 instructions to issue across a warp:
LD R0, a[idx]
LD R1, b[idx]
MPY R2, R0, R1
The first instruction is a LD from global memory, and it can be issued and does not stall the warp. The second instruction likewise can be issued. The warp will stall at the 3rd instruction, however, due to latency from global memory. Until R0 and R1 become properly populated, the multiply instruction cannot be dispatched. Latency from main memory prevents it. The GPU deals with this problem by (hopefully) having a ready supply of "other work" it can turn to, namely other warps in an unstalled state (i.e. that have an instruction that can be issued). The best way to facilitate this latency-hiding process is to have many warps available to the SMX. There isn't any granularity to this (such as needing 4 warps). Generally speaking, the more threads/warps/blocks that are in your grid, the better chance the GPU will have of hiding latency.
So it is true that the GPU cannot "launch" 2048 threads (i.e. issue instructions from 2048 threads) in a single clock cycle. But when a warp stalls, it is put into a waiting queue until the stall condition is lifted, and until then, it is helpful to have other warps "ready to go", for the next clock cycle(s).
GPU latency hiding is a commonly misunderstood topic. There are many available resources to learn about it if you search for them.