1
votes

I have some queries regarding how data transfer happens between work items and global memory. Let us consider the following highly inefficient memory bound kernel.

__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
    size_t gid = get_global_id(0);//line no 1

    myreal pCoef = coef[gid];//line no 2
    myreal pRow = row[gid];//line no 3

    pCoef = pCoef - (pRow * ratio);//line no 4
    coef[gid] = pCoef;//line no 5
}
  1. Do all work items in a work group begin executing line no 1 at the same time?
  2. Do all work items in a work group begin executing line no 2 at the same time?
  3. Suppose different work items in a work group finish executing line no 4 at different times. Do the early finished ones wait so that, all work items transfer the data to global memory at the same time in line no 5?
  4. Do all work items exit the compute unit simultaneously such that early finished work items have to wait until all work items have finished executing?
  5. Suppose each kernel has to perform 2 reads from global memory. Is it better to execute these statements one after the other or is it better to execute some computation statements between the 2 read executions?
  6. The above shown kernel is memory bound for GPU. Is there any way by which performance can be improved?
  7. Are there any general guidelines to avoid memory bounds?
1

1 Answers

1
votes

Find my answers below: (thanks sharpneli for the good comment of AMD GPUs and warps)

  1. Normally YES. But depends on the hardware. You can't directly expect that behavior and design your algorithm on this "ordered execution". That's why barriers and mem_fences exists. For example, some GPU execute in order only a sub-set of the WG's WI. In CPU it is even possible that they run completely free of order.
  2. Same as answer 1.
  3. As in the answer 1, they will really unlikely finish at different times, so YES. However you have to bear in mind that this is a good feature, since 1 big write to memory is more efficient than a lot of small writes.
  4. Typically YES (see answer 1 as well)
  5. It is better to intercalate the reads with operations, but the compiler will already account for this and reorder the operation order to hide the latency of reading/writting effects. Of course the compiler will never move around code that can change the result value. Unless you disable manually the compiler optimizations this is a typical behavior of OpenCL compilers.
  6. NO, it can't be improved in any way from the kernel point of view.
  7. The general rule is, each memory cell of the input is used by more than one WI?
    • NO (1 global->1 private) (this is the case of your kernel in the question)
    • Then that memory is global->private, and there is no way to improve it, don't use local memory since it will be a waste of time.
    • YES (1 global-> X private)
    • Try to move the global memory lo local memory first, then read directly from local to private for each WI. Depending on the reuse amount (maybe only 2 WIs use the same global data) it may not even be worth if the computation amount is already high. You have to consider the tradeoff between extra memory usage and global access gain. For image procesing it is typically a good idea, for other types of processes not so much.

NOTE: The same process applies if you try to write to global memory. It is always better to operate in local memory by many WI before writing to global. But if each WI writes to an unique address in global, then write directly.