clarification about CUDA number of threads executed per SM

Question

I am new to cuda programming and am reading about a G80 chip which has 128 SPs(16 SMs, each with 8 SPs) from the book "Programming Massively Parallel Processors - A hands on approach". There is a comparison between Intel CPUs and G80 chip. Intel CPUs support 2 to 4 threads, depending on the machine model, per core. where as the G80 chip supports 768 threads per SM, which sums up to 12000 threads for this chip.

My question here is it that the G80 chip can execute 768 threads simultaneously ? If not simultaneously then what is meant by Intel CPUs support 2 to 4 threads per core ? We can always have many threads/processes running on the Intel CPU scheduled by the OS.

lashgar lashgar · Accepted Answer · 2012-09-17T14:21:03

G80 keep the context for 768 threads per SM concurrently and interleaves their execution. This is the key difference between CPU and GPU. GPUs are deep-multithreaded processor hiding memory accesses of some threads by the computation from other threads. The latency of executing a thread is much higher that the CPU and GPU is optimized for thread throughput instead of thread latency. In comparison, CPUs use out-of-order speculative execution to reduce the execution delay of one thread. There are several technique used by GPUs to reduce thread scheduling overhead. For example, GPUs group threads in coarser schedulable element called warps of wavefront and execute threads of the warp over an SIMD. GPU threads are identical making them suitable choice for SIMD model. In the eye of the programmer, threads are executed in MIMD fashion and they are grouped in thread blocks to reduce communication overhead.

Threads employed in a CPU core are used to fill different execution units by dynamic scheduling. CPU threads are not necessarily at the same type. It means once a thread is busy with the floating point other threads may find ALU idle. Therefore, execution of these thread can be done concurrently. Multiple threads per core are maintained to fill different execution units effectively preventing idle units. However, dynamic scheduling is costly in term of power and energy consumption. Therefore, manufacturer use a few threads per CPU core.

In answer to second part of your question: Threads in GPUs are scheduled by hardware (per SM warp scheduler) and the OS and even driver do not affect the scheduling.

clarification about CUDA number of threads executed per SM

3 Answers