I have some queries regarding how data transfer happens between work items and global memory. Let us consider the following highly inefficient memory bound kernel.
__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
size_t gid = get_global_id(0);//line no 1
myreal pCoef = coef[gid];//line no 2
myreal pRow = row[gid];//line no 3
pCoef = pCoef - (pRow * ratio);//line no 4
coef[gid] = pCoef;//line no 5
}
- Do all work items in a work group begin executing line no 1 at the same time?
- Do all work items in a work group begin executing line no 2 at the same time?
- Suppose different work items in a work group finish executing line no 4 at different times. Do the early finished ones wait so that, all work items transfer the data to global memory at the same time in line no 5?
- Do all work items exit the compute unit simultaneously such that early finished work items have to wait until all work items have finished executing?
- Suppose each kernel has to perform 2 reads from global memory. Is it better to execute these statements one after the other or is it better to execute some computation statements between the 2 read executions?
- The above shown kernel is memory bound for GPU. Is there any way by which performance can be improved?
- Are there any general guidelines to avoid memory bounds?