The actual throughput achieved by a kernel is reported by CUDA profiler using four metrics:
- Global memory load throughput
- Global memory store throughput
- DRAM read throughput
- DRAM write throughput
CUDA C Best Practices Guide describes Global memory load/store throughput as the actual throughput and it says nothing specific about DRAM read/write throughput.
CUPTI Users Guide defines:
- Global memory load throughput as ((128*global_load_hit) + (l2_subp0_read_requests + l2_subp1_read_requests) * 32 - (l1_cached_local_ld_misses * 128))/(gputime)
- Global memory store throughput as (l2_subp0_write_requests + l2_subp1_write_requests) * 32 - (l1_cached_local_ld_misses * 128))/(gputime)
- DRAM read throughput as (fb_subp0_read + fb_subp1_read) * 32 / gputime
- DRAM write throughput as (fb_subp0_write + fb_subp1_write) * 32 / gputime
I understand the DRAM read/write throughput since fb_subp* counters report a number of DRAM accesses (incremented by 1 for 32 byte access) and are collected for all SMs. So it is clear for me that throughput is calculated as function of gputime and number of bytes accessed.
I do not understand the Global memory throughput definition. There is no definition of the global_load_hit and counter. I do not see why l1_cached_local_ld_misses is substracted in both cases.
Is DRAM something different than Global memory in this context?
If I want to know what is the actual throughput of my kernel should I use the DRAM or the Global memory throughput metrics?