Performance gap between two almost the same OpenCL kernels

Question

I have two almost the same OpenCL kernels which I want to calculate their performance in GFLOPS. Kernel #1 is:

__kernel void Test41(__global float *data, __global float *rands, int index, int rand_max){

    float16 temp;
    int gid = get_global_id(0);

    temp = data[gid];
    temp = (float) rands[1] * temp;
    temp = (float) rands[2] * temp;
    temp = (float) rands[3] * temp;
    temp = (float) rands[4] * temp;
    .
    .
    .
    temp = (float) rands[497] * temp;
    temp = (float) rands[498] * temp;
    temp = (float) rands[499] * temp;
    data[gid] = temp.s0;

}

The second kernel is:

__kernel void Test42(__global float *data, __global float *rands, int index, int rand_max){

    float16 temp[500];
    int gid = get_global_id(0);

    temp[0] = data[gid];
    temp[1] = (float) rands[1] * temp[0];
    temp[2] = (float) rands[2] * temp[1];
    temp[3] = (float) rands[3] * temp[2];
    temp[4] = (float) rands[4] * temp[3];
    .
    .
    .
    temp[497] = (float) rands[497] * temp[496];
    temp[498] = (float) rands[498] * temp[497];
    temp[499] = (float) rands[499] * temp[498];
    data[gid] = temp[index].s0;

}

As you can see in code, I'm using stream size of 16. every kernel has 500 lines of operations, where each of them only does a single floating point operation. I also deploy around 1048576 kernels in total, so I will have around 1048576 work items to execute in parallel.

In order to calculate the flops I do:

flops = #numWorkItems(1048576) * (500) * StreamSize(16) / timeTaken;

Unfortunately for the first kernel I get around 1.4 TFLOPs, but for the second kernel I get 38 GFLOPs. I was not able to explain this huge gap. using a vector of temp instead of a single temp seems to be a huge deal. Also seems like real applications are mostly like the second kernel. First kernel is too simple for a real application.

Can anyone help me to understand what exactly happening here and how second kernel performance can reach first one? In general, if I'm going to benchmark my device, should I expect to see performance near the theoretical value?

P.S. I understand I need to copy rands into a __local memory, but let's skip that for now.

There are a large number of registers being allocated for the array of 500 float16 values in the second kernel, so most probably the kernel occupancy is decreased causing the kernel to run slow. — sgarizvi
500*16*4 = 32kB for just a single thread. Even single workitem per workgroup would be slow because of the memory usage. — huseyin tugrul buyukisik
What is the reasonable memory usage per work item? I just wanna have a better grasp while designing my benchmark. — saman
Check out the various hardware vendors' OpenCL optimisation guides. Nvidia, AMD, and Intel all have them and give you an idea of how to estimate occupancy for their devices. Mobile device vendors probably provide similar documentation. Typically, each CU has a fixed number of registers, and if you divide that by the number of registers used by one thread, that gives you an upper bound on the number of threads that can run on a CU simultaneously. — pmdj
Hi. I've made some changes into the code, so instead of allocating a local temp array of size 500, I'll allocate 10 and then use these in a circular manner in all 500 statements I have in the kernel. But I still get the same performance numbers. What could be the reason here? — saman

Ruyk Ruyk · Accepted Answer · 2017-05-09T08:31:55

As @pmdj has suggested in the comments, the main problem of the second kernel is register pressure: You are using a large number of hardware registers, which reduces the number of simultaneous work groups executing. In general, large private arrays are not a good idea in OpenCL/CUDA. There is very little a compiler can do to optimize the performance in that case. You could use local memory for the array, but then you need to add the appropriate synchronisation to access it.

Performance gap between two almost the same OpenCL kernels

2 Answers