11
votes

Let's take the nVidia Fermi Compute Architecture. It says:

The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each.

[...]

Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU).

[...]

In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. V

From what I know, and what is unclear for me, is that GPUs execute the threads in so called warps, each warp consists of ~32 threads. Each warp is assigned to only one core (is that true?). So does that mean, that each of the 32 cores of a single SM is a SIMD processor, where a single instruction handles 32 data portions ? If so, then why we say there are 32 threads in a warp, not a single SIMD thread? Why cores are sometimes referred to as scalar processors, not vector processors ?

2
to whoever voted to close this question: what is unclear in a question whether a GPU core has a SIMD architecture?Marc Andreson

2 Answers

18
votes

Each warp is assigned to only one core (is that true?).

No, it's not true. A warp is a logical assembly of 32 threads of execution. To execute a single instruction from a single warp, the warp scheduler must usually schedule 32 execution units (or "cores", although the definition of a "core" is somewhat loose).

Cores are in fact scalar processors, not vector processors. 32 cores (or execution units) are marshalled by the warp scheduler to execute a single instruction, across 32 threads, which is where the "SIMT" moniker comes from.

9
votes

CUDA "cores" can be thought of as SIMD lanes.

First let's recall that the term "CUDA core" is nVIDIA marketing-speak. These are not cores the same way a CPU has cores. Similarly, "CUDA threads" are not the same as the threads we know on CPUs.

The equivalent of a CPU core on a GPU is a "symmetric multiprocessor": It has its own instruction scheduler/dispatcher, its own L1 cache, its own shared memory etc. It is CUDA thread blocks rather than warps that are assigned to a GPU core, i.e. to a streaming multiprocessor. Within an SM, warps get selected to have instructions scheduled, for the entire warp. From a CUDA perspective, those are 32 separate threads, which are instruction-locked; but that's really no different than saying that a warp is like a single thread, which only executes 32-lane-wide SIMD instructions. Of course this isn't a perfect analogy, but I feel it's pretty sound. Something you don't quite / don't always have on CPU SIMD lanes is a masking of which lanes are actively executing, where inactive lanes will have not have the effect of active lanes' setting of register values, memory writes etc.

I hope this makes intuitive sense to you (or perhaps you've figured this out yourself over the past 2 years).