1
votes

I'm reading for the answers and there are conflict ideas: In this link https://www.3dgep.com/cuda-thread-execution-model/, two warps (64 threads) can run concurrently on an SM (32 CUDA cores). So, I understand that the threads on a warp are splited and be processed on 16 CUDA cores. This idea makes sense for me because each CUDA core has 1 32bitALU.

However, in other links, they claimed that 1 CUDA core is able to handle 32 concurrent threads (same as a warp size) (https://cvw.cac.cornell.edu/GPU/simt_warp). So, 1 CUDA warp can be processed by one single CUDA core only. This also makes sense because all threads on the same warp using the same PC counter.

So, My question is how a CUDA warp is mapped with CUDA cores?

1
There is no one answer to this question. It depends on hardware and has evolved over time. The cornell link is plainly wrong, there have never been any GPUs that work the way that text describedtalonmies

1 Answers

7
votes

Inside a CUDA GPU, there are computing units called SMs (Streaming Multiprocessor). Each SM has a variety of hardware resources (warp schedulers, instruction fetch/decode, register file, execution/functional units, shared memory, L1 cache, etc.) which are used to support CUDA threads of execution.

Whenever an instruction is issued, it is issued warp-wide. Therefore, any instruction issued will require 32 functional units for that type of instruction. CUDA low-level instructions (SASS) can be broken into a number of categories, and there is a functional unit type that will handle that instruction, or instructions in that category. For example a load-from-memory instruction (e.g. LD) will be handled by a LD/ST unit (load/store). There are a number of different kinds of these instruction processing units.

Some additional particular kinds of units are SP and DP units. An SP unit can handle a single-precision floating point multiply, add, or multiply-add instruction. A DP unit is similar except that it handles instructions working on double-precision floating point types.

To issue an instruction, therefore, a warp-scheduler will ultimately need 32 of the type of unit appropriate for that instruction type. For a single-precision floating point multiply operation, it will require 32 SP units to be available, in that cycle, to receive that issued instruction.

Other types of instructions will still require 32 units (eventually) but there may not be 32 of a given type of unit in the SM. When there are fewer than 32 of a particular type of unit, the warp scheduler will schedule a single instruction across multiple clock cycles. Suppose, for example, that a particular GPU SM design was such that there are only 4 DP units. Then the warp scheduler, when it has e.g. a DP multiply operation/instruction to issue, will use those 4 units for a total of 8 clock cycles (4x8=32) so as to provide a functional/execution unit for each instruction when considered per-thread, warp-wide. A functional unit is ultimately needed per-thread. Each functional unit can handle one instruction, for one thread, per clock. To handle the instruction issued warp-wide, either 32 functional units will be needed so the instruction can be issued in a single clock cycle, or the instruction will be issued over multiple clock cycles, to a smaller number of functional units.

The term "core" in CUDA is generally used to refer to a SP unit as defined above. Given this, we can immediately determine that:

  1. A CUDA "core" is really not like a CPU core.
  2. A CUDA "core" will only be involved in instruction processing for a relatively small number of instruction types, including SP floating-point add, multiply, and multiply-add. Any other instruction type will require a different kind of functional unit, to handle that instruction. And just because a SM contains, for example 128 CUDA cores (i.e. SP units), does not mean that it also contains 128 DP units, or 128 LD/ST units, or a particular number of any other functional unit type. The number of functional units in a SM can and do vary by functional unit type. Different GPU architectures (Maxwell, Pascal, Volta) and different compute capabilities within an architecture, may have different mixes or quantities of these functional unit types.