The basic problem is that you have not "saturated" the GPU with work. There are various overheads associated with the kernel launch. If the amount of time the kernel spends computing is small compared to this overhead, then your calculation will be skewed by the overhead.
T = Overhead time(OT) + Calculation time(CT)
Flops/s = Flops/T = Flops/(OT + CT)
If the calculation time is small compared to the Overhead time (which is the case for your kernels) then your calculation will be affected by the overhead time. On the other hand, if Calculation time is large enough compared to the overhead, then the overhead has relatively little effect on the results.
Here's a full test case, with a few cases run, CUDA 9.1, Tesla P100 PCIE:
$ cat t79.cu
#ifndef SLEN
#define SLEN (8)
#endif
#ifndef NTPB
#define NTPB (256)
#endif
#ifndef BLKS
#define BLKS (16384/NTPB)
#endif
const size_t blks = BLKS;
const size_t ntpb = NTPB;
typedef float Ftype;
#include <iostream>
template <int LEN>
__global__ void WGSXMAPIXLLXOPS (Ftype *GOut, const Ftype M) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
Ftype MF = (Ftype) M;
for (int lcdd = 0; lcdd < 1; lcdd++) {
float temp1 = 1.0;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
if (LEN > 8){
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;}
if (LEN > 16){
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;}
if (LEN > 32){
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;}
if (LEN > 64){
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;}
if (LEN > 128){
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;}
if (LEN > 256){
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;}
#ifdef NO_WRITE
if (temp1 == -10.0)
#endif
GOut[gid] = temp1;
}
}
int main(){
float et;
Ftype *GOut;
const Ftype M = 1.0;
cudaMalloc(&GOut, blks*ntpb*sizeof(Ftype));
cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
WGSXMAPIXLLXOPS<SLEN><<<blks, ntpb>>> (GOut, M);
cudaDeviceSynchronize();
cudaEventRecord(start);
WGSXMAPIXLLXOPS<SLEN><<<blks, ntpb>>> (GOut, M);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&et, start, stop);
unsigned long long flpcnt = SLEN*2*blks*ntpb;
float Kflops_s = flpcnt/et;
std::cout << "MFlops per sec: " << Kflops_s/1000 << " kernel time: " << et << "ms" << std::endl;
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_60 -o t79 t79.cu
$ ./t79
MFlops per sec: 14371.9 kernel time: 0.01824ms
$ nvprof ./t79
==14676== NVPROF is profiling process 14676, command: ./t79
MFlops per sec: 10101.1 kernel time: 0.025952ms
==14676== Profiling application: ./t79
==14676== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 3.2320us 2 1.6160us 1.2480us 1.9840us void WGSXMAPIXLLXOPS<int=8>(float*, float)
API calls: 98.31% 389.62ms 1 389.62ms 389.62ms 389.62ms cudaMalloc
1.10% 4.3574ms 376 11.588us 357ns 465.31us cuDeviceGetAttribute
0.42% 1.6829ms 4 420.73us 272.19us 642.45us cuDeviceTotalMem
0.12% 487.27us 4 121.82us 90.094us 164.09us cuDeviceGetName
0.02% 80.363us 2 40.181us 15.789us 64.574us cudaLaunch
0.00% 17.118us 2 8.5590us 8.1400us 8.9780us cudaDeviceSynchronize
0.00% 13.118us 2 6.5590us 5.4290us 7.6890us cudaEventRecord
0.00% 10.603us 2 5.3010us 1.2440us 9.3590us cudaEventCreate
0.00% 8.5080us 8 1.0630us 460ns 1.7500us cuDeviceGet
0.00% 8.4590us 1 8.4590us 8.4590us 8.4590us cudaEventElapsedTime
0.00% 7.1350us 1 7.1350us 7.1350us 7.1350us cudaEventSynchronize
0.00% 6.8430us 4 1.7100us 180ns 5.9720us cudaSetupArgument
0.00% 4.7800us 3 1.5930us 437ns 2.8480us cuDeviceGetCount
0.00% 2.3490us 2 1.1740us 361ns 1.9880us cudaConfigureCall
$ nvcc -arch=sm_60 -o t79 t79.cu -DSLEN=512 -DBLKS=32768 -DNTPB=1024
$ ./t79
MFlops per sec: 8.08072e+06 kernel time: 4.25206ms
$
$ nvprof --metrics sm_efficiency_instance,ipc_instance,issued_ipc,issue_slot_utilization,stall_exec_dependency ./t79
==15447== NVPROF is profiling process 15447, command: ./t79
==15447== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "void WGSXMAPIXLLXOPS<int=512>(float*, float)" (done)
Replaying kernel "void WGSXMAPIXLLXOPS<int=512>(float*, float)" (done)
MFlops per sec: 193432 kernel time: 177.632ms
==15447== Profiling application: ./t79
==15447== Profiling result:
==15447== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla P100-PCIE-16GB (0)"
Kernel: void WGSXMAPIXLLXOPS<int=512>(float*, float)
2 issued_ipc Issued IPC 1.972106 1.972388 1.972247
2 issue_slot_utilization Issue Slot Utilization 98.23% 98.24% 98.24%
2 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 16.35% 16.36% 16.36%
2 ipc Executed IPC 1.971976 1.972254 1.972115
2 sm_efficiency Multiprocessor Activity 99.78% 99.78% 99.78%
$
The first run, with numbers that match yours (16384 threads, 256 threads per block, 8 FFMA instructions) shows a kernel duration of ~17us. However when we run that case in the profiler, we observe that the actual kernel execution is only about 1.5us, and the remainder is various kinds of overhead, including kernel launch latency, as well as the latency of using the cudaEvent
system for timing. So this throws the numbers way off.
On the other hand, when we launch a large amount of blocks, and threads per block, and work per thread, we get a number that is 80% of the peak capability of the P100.
Most of your metrics are increasing (getting better) as you go from kernel 1 to 3 (excepting dram throughput, which is sensible. As the kernel time increases, for the same amount of data written, the dram average throughput goes down). This is consistent with giving the GPU more work, so that it can hide various kinds of latency and amortize overhead over a large quantity of work.
Let's take a look at some of these metrics for the final run/"large" kernel above:
2 issued_ipc Issued IPC 1.972106 1.972388 1.972247
2 issue_slot_utilization Issue Slot Utilization 98.23% 98.24% 98.24%
2 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 16.35% 16.36% 16.36%
2 ipc Executed IPC 1.971976 1.972254 1.972115
2 sm_efficiency Multiprocessor Activity 99.78% 99.78% 99.78%
IPC is at around 2 per clock, higher than your kernel 3. Note that an IPC of 2 is a reasonable upper bound here: sm_60
SM has 64 single-precision units, enough to schedule 2 FFMA instructions per clock.
The SM efficiency and issue_slot_utilization are similar metrics. This means that that about 98% of the time, the SM could issue one or more instructions in any given clock cycle.
The stall(exec dependency) is answering the question, "across all the actual stall situations, what percent were due to execution dependency?". Your kernel has an execution dependency between each line of source code - since each depends on the results of the previous line. This means at the assembly level, each FFMA instruction will depend on the results of the previous one, so it cannot be issued until the previous one is complete.
If the SM were undersubscribed with available work, then the stall exec dependency would go up, because the thing preventing issuance of additional work would be the exec dependency. A number of 16% here means that about 5/6 of the time, when there is a stall scenario, it is not due to an exec dependency. In other words, even though we have plenty of execution dependency in this kernel, most of the time when there was a stall, it was not because the GPU would have liked to go to the next line of code to issue - it was for some other reason.
Summary:
There seem to be at least 2 issues, both related to different kinds of latency:
- At the very small kernel sizes (e.g. 16384 total threads) the kernel execution time is short, so measurement is clouded by e.g. kernel launch latency and possibly measurement latency.
- The kernel sizes, being very small, do not saturate the GPU with as much parallel work as can be delivered, and so things like IPC and
sm_efficiency
are lower than they need to be, and stall reasons: exec dependency is relatively high.
Any time you see a sm_efficiency
that is that low, a possible conclusion is that not enough parallel work has been exposed to the GPU, and so neither compute throughput nor memory are the limiting factors, but instead latency is the limiting factor to performance.
This is consistent with the analysis-driven optimization logic (slide 46 and beyond)
and can be rectified by exposing more work to the GPU.