A warp consists of 32 threads that will be executed at the same time. At any given time a batch of 32 will be executing on the GPU, and this is called a warp.
I haven't found anywhere that states that you can control what warp is going to execute next, the only thing you know is that it consists of 32 threads and that a threadblock should always be a multiple of that number.
Threads in a single block will be executed on a single multiprocessor, sharing the software data cache, and can synchronize and share data with threads in the same block; a warp will always be a subset of threads from a single block.
There is also this, with regards to memory operations and latency:
When the threads in a warp issue a device memory operation, that instruction will take a very long time, perhaps hundreds of clock cycles, due to the long memory latency. Mainstream architectures would add a cache memory hierarchy to reduce the latency, and Fermi does include some hardware caches, but mostly GPUs are designed for stream or throughput computing, where cache memories are ineffective. Instead, these GPUs tolerate memory latency by using a high degree of multithreading. A Tesla supports up to 32 active warps on each multiprocessor, and a Fermi supports up to 48. When one warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that one. In this way, the cores can be productive as long as there is enough parallelism to keep them busy.
With regards to dividing up threadblocks into warps, I have found this:
if the block is 2D or 3D, the threads are ordered by first dimension, then second, then third – then split into warps of 32