I am trying to undertand the architecture of the GPUs and how we assess the performance of our programs on the GPU. I know that the application can be:
- Compute-bound: performance limited by the FLOPS rate. The processor’s cores are fully utilized (always have work to do)
Memory-bound: performance limited by the memory bandwidth. The processor’s cores are frequently idle because memory cannot supply data fast enough
The image below shows the FLOPS rate, peak memory bandwidth, and the Desired Compute to memory ratio, labeled by (OP/B), for each microarchitecture.
I also have an example of how to compute this OP/B metric. Example: Below is part of a CUDA code for applying matrix-matrix multiplication
for(unsigned int i = 0; i < N; ++i) {
sum += A[row*N + i]*B[i*N + col];
}
and the way to calculate OP/B for this matrix-matrix multiplication is as follows:
- Matrix multiplication performs 0.25 OP/B
- 1 FP add and 1 FP mul for every 2 FP values (8B) loaded
- Ignoring stores
and if we want to utilize this:
- But matrix multiplication has high potential for reuse. For NxN matrices:
- Data loaded: (2 input matrices)×(N^2 values)×(4 B) = 8N^2 B
- Operations: (N^2 dot products)(N adds + N muls each) = 2N^3 OP
- Potential compute-to-memory ratio: 0.25N OP/B
So if I understand this clearly well, I have the following questions:
- It is always the case that the greater OP/B, the better ?
- how do we know how much FP operations we have ? Is it the adds and the multiplications
- how do we know how many bytes are loaded per FP operation ?