CUDA Block parallelism

Question

I am writing some code in CUDA and am a little confused about the what is actually run parallel.

Say I am calling a kernel function like this: kenel_foo<<<A, B>>>. Now as per my device query below, I can have a maximum of 512 threads per block. So am I guaranteed that I will have 512 computations per block every time I run kernel_foo<<<A, 512>>>? But it says here that one thread runs on one CUDA core, so that means I can have 96 threads running concurrently at a time? (See device_query below).
I wanted to know about the blocks. Every time I call kernel_foo<<<A, 512>>>, how many computations are done in parallel and how? I mean is it done one block after the other or are blocks parallelized too? If yes, then how many blocks can run 512 threads each in parallel? It says here that one block is run on one CUDA SM, so is it true that 12 blocks can run concurrently? If yes, the each block can have a maximum of how many threads, 8, 96 or 512 running concurrently when all the 12 blocks are also running concurrently? (See device_query below).
Another question is that if A had a value ~50, is it better to launch the kernel as kernel_foo<<<A, 512>>> or kernel_foo<<<512, A>>>? Assuming there is no thread syncronization required.

Sorry, these might be basic questions, but it's kind of complicated... Possible duplicates:
Streaming multiprocessors, Blocks and Threads (CUDA)
How do CUDA blocks/warps/threads map onto CUDA cores?

Thanks

Here's my device_query:

Device 0: "Quadro FX 4600"
CUDA Driver Version / Runtime Version          4.2 / 4.2
CUDA Capability Major/Minor version number:    1.0
Total amount of global memory:                 768 MBytes (804978688 bytes)
(12) Multiprocessors x (  8) CUDA Cores/MP:    96 CUDA Cores
GPU Clock rate:                                1200 MHz (1.20 GHz)
Memory Clock rate:                             700 Mhz
Memory Bus Width:                              384-bit
Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       16384 bytes
Total number of registers available per block: 8192
Warp size:                                     32
Maximum number of threads per multiprocessor:  768
Maximum number of threads per block:           512
Maximum sizes of each dimension of a block:    512 x 512 x 64
Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
Maximum memory pitch:                          2147483647 bytes
Texture alignment:                             256 bytes
Concurrent copy and execution:                 No with 0 copy engine(s)
Run time limit on kernels:                     Yes
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       No
Concurrent kernel execution:                   No
Alignment requirement for Surfaces:            Yes
Device has ECC support enabled:                No
Device is using TCC driver mode:               No
Device supports Unified Addressing (UVA):      No
Device PCI Bus ID / PCI location ID:           2 / 0

Tom Tom · Accepted Answer · 2013-02-12T12:45:49

Check out this answer for some first pointers! The answer is a little out of date in that it is talking about older GPUs with compute capability 1.x, but that matches your GPU in any case. Newer GPUs (2.x and 3.x) have different parameters (number of cores per SM and so on), but once you understand the concept of threads and blocks and of oversubscribing to hide latencies the changes are easy to pick up.

Also, you could take this Udacity course or this Coursera course to get going.

CUDA Block parallelism

1 Answers