I have two almost the same OpenCL kernels which I want to calculate their performance in GFLOPS. Kernel #1 is:
__kernel void Test41(__global float *data, __global float *rands, int index, int rand_max){
float16 temp;
int gid = get_global_id(0);
temp = data[gid];
temp = (float) rands[1] * temp;
temp = (float) rands[2] * temp;
temp = (float) rands[3] * temp;
temp = (float) rands[4] * temp;
.
.
.
temp = (float) rands[497] * temp;
temp = (float) rands[498] * temp;
temp = (float) rands[499] * temp;
data[gid] = temp.s0;
}
The second kernel is:
__kernel void Test42(__global float *data, __global float *rands, int index, int rand_max){
float16 temp[500];
int gid = get_global_id(0);
temp[0] = data[gid];
temp[1] = (float) rands[1] * temp[0];
temp[2] = (float) rands[2] * temp[1];
temp[3] = (float) rands[3] * temp[2];
temp[4] = (float) rands[4] * temp[3];
.
.
.
temp[497] = (float) rands[497] * temp[496];
temp[498] = (float) rands[498] * temp[497];
temp[499] = (float) rands[499] * temp[498];
data[gid] = temp[index].s0;
}
As you can see in code, I'm using stream size of 16. every kernel has 500 lines of operations, where each of them only does a single floating point operation. I also deploy around 1048576 kernels in total, so I will have around 1048576 work items to execute in parallel.
In order to calculate the flops I do:
flops = #numWorkItems(1048576) * (500) * StreamSize(16) / timeTaken;
Unfortunately for the first kernel I get around 1.4 TFLOPs, but for the second kernel I get 38 GFLOPs. I was not able to explain this huge gap. using a vector of temp instead of a single temp seems to be a huge deal. Also seems like real applications are mostly like the second kernel. First kernel is too simple for a real application.
Can anyone help me to understand what exactly happening here and how second kernel performance can reach first one? In general, if I'm going to benchmark my device, should I expect to see performance near the theoretical value?
P.S. I understand I need to copy rands into a __local memory, but let's skip that for now.