0
votes

I have two almost the same OpenCL kernels which I want to calculate their performance in GFLOPS. Kernel #1 is:

__kernel void Test41(__global float *data, __global float *rands, int index, int rand_max){

    float16 temp;
    int gid = get_global_id(0);

    temp = data[gid];
    temp = (float) rands[1] * temp;
    temp = (float) rands[2] * temp;
    temp = (float) rands[3] * temp;
    temp = (float) rands[4] * temp;
    .
    .
    .
    temp = (float) rands[497] * temp;
    temp = (float) rands[498] * temp;
    temp = (float) rands[499] * temp;
    data[gid] = temp.s0;

}

The second kernel is:

__kernel void Test42(__global float *data, __global float *rands, int index, int rand_max){

    float16 temp[500];
    int gid = get_global_id(0);

    temp[0] = data[gid];
    temp[1] = (float) rands[1] * temp[0];
    temp[2] = (float) rands[2] * temp[1];
    temp[3] = (float) rands[3] * temp[2];
    temp[4] = (float) rands[4] * temp[3];
    .
    .
    .
    temp[497] = (float) rands[497] * temp[496];
    temp[498] = (float) rands[498] * temp[497];
    temp[499] = (float) rands[499] * temp[498];
    data[gid] = temp[index].s0;

}

As you can see in code, I'm using stream size of 16. every kernel has 500 lines of operations, where each of them only does a single floating point operation. I also deploy around 1048576 kernels in total, so I will have around 1048576 work items to execute in parallel.

In order to calculate the flops I do:

flops = #numWorkItems(1048576) * (500) * StreamSize(16) / timeTaken;

Unfortunately for the first kernel I get around 1.4 TFLOPs, but for the second kernel I get 38 GFLOPs. I was not able to explain this huge gap. using a vector of temp instead of a single temp seems to be a huge deal. Also seems like real applications are mostly like the second kernel. First kernel is too simple for a real application.

Can anyone help me to understand what exactly happening here and how second kernel performance can reach first one? In general, if I'm going to benchmark my device, should I expect to see performance near the theoretical value?

P.S. I understand I need to copy rands into a __local memory, but let's skip that for now.

2
There are a large number of registers being allocated for the array of 500 float16 values in the second kernel, so most probably the kernel occupancy is decreased causing the kernel to run slow.sgarizvi
500*16*4 = 32kB for just a single thread. Even single workitem per workgroup would be slow because of the memory usage.huseyin tugrul buyukisik
What is the reasonable memory usage per work item? I just wanna have a better grasp while designing my benchmark.saman
Check out the various hardware vendors' OpenCL optimisation guides. Nvidia, AMD, and Intel all have them and give you an idea of how to estimate occupancy for their devices. Mobile device vendors probably provide similar documentation. Typically, each CU has a fixed number of registers, and if you divide that by the number of registers used by one thread, that gives you an upper bound on the number of threads that can run on a CU simultaneously.pmdj
Hi. I've made some changes into the code, so instead of allocating a local temp array of size 500, I'll allocate 10 and then use these in a circular manner in all 500 statements I have in the kernel. But I still get the same performance numbers. What could be the reason here?saman

2 Answers

0
votes

As @pmdj has suggested in the comments, the main problem of the second kernel is register pressure: You are using a large number of hardware registers, which reduces the number of simultaneous work groups executing. In general, large private arrays are not a good idea in OpenCL/CUDA. There is very little a compiler can do to optimize the performance in that case. You could use local memory for the array, but then you need to add the appropriate synchronisation to access it.

-1
votes

There are two possible issues:

  • you declared float16 temp buffer as __private {which is default in OpenCL} and most likely it will be allocated in the global memory space with quite high access latency. You might try to declare it as __local float16 if it would fit your device local memory.
  • Adding temp buffer created some problems for compiler... Original code is easily vectorizable on some GPU architectures (Intel for example) and you added artificial dependencies by adding store+load

I'd actually submit an issue report on the compiler for doing that. It should be easy enough for the compiler to figure out dependencies, do optimizations and even get rid of your temp buffer.