From CUDA Programming Guide 4.2:
[...] at every instruction issue time, a warp scheduler selects a warp
that has threads ready to execute its next instruction (the active
threads of the warp) and issues the instruction to those threads.
So, the maximum number of concurrent running waprs per SM is equal to the number of warp schedulers (WS).
GeForce 580 has 2.1 architecture:
For devices of compute capability 2.x, a multiprocessor consists of: [...] 2 warp schedulers
This means, each SM of your GPU can run 2 warps = 64 threads concurrently, making it 448 threads total. Please note, however, that it's highly recommended to use much more threads than that:
The number of clock cycles it takes for a warp to be ready to execute
its next instruction is called the latency, and full utilization is
achieved when all warp schedulers always have some instruction to
issue for some warp at every clock cycle during that latency period,
or in other words, when latency is completely “hidden”.
Regarding your other questions: GeForce GTX690 has 3072 CUDA Cores. However, for CUDA it would seem like two separate GPUs with 1536 cores each, so it's not better then two GeForce 680, and the latter is easily overclocked judging by numerous online reviews. The largest memory among GPUs is installed in nVidia Tesla M2090: 6GiB of GDDR5 (512 CUDA Cores). I guess, soon the new family of Teslas, based on Kepler architecture like GeForce 6xx, will be released, but I haven't heard of any official announces.