Low performance kernel

Question

I have a CUDA kernel where there are many operations and few branches. It looks like

__global__
void kernel(Real *randomValues, Real mu, Real sigma)
{
    int row = blockDim.y * blockIdx.y + threadIdx.y;
    int col = blockDim.x * blockIdx.x + threadIdx.x;

    if ( row >= cnTimeSteps || col >= cnPaths ) return;

    Real alphaLevel = randomValues[row*cnPaths+col];
    Real q = 0.0;
    Real x = 0.0;

    if ( alphaLevel < p_low)
    {
        q = sqrt( -2*log( alphaLevel ) );
        x = (((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1);
    }
    else if ( alphaLevel < p_high )
    {
        q = alphaLevel-0.5;
        Real r = q*q;
        x= (((((a1*r+a2)*r+a3)*r+a4)*r+a5)*r+a6)*q / (((((b1*r+b2)*r+b3)*r+b4)*r+b5)*r+1);
    }
    else
    {
        q = sqrt( -2*log( 1.0-alphaLevel ) );
        x = -(((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1);
    }

    randomValues[row*cnPaths+col] = sigma * x + mu;
}

where all the a's, b's, c's and d's are constant values (in the device constant memory)

static __device__ __constant__ Real a1 = 1.73687;
static __device__ __constant__ Real a2 = 1.12321100;

and so on.

After profiling the kernel I found that the theoretical occupancy is 100% but I am getting no more than 60%.

I went through this and this GTC talks to try to optimize my kernel.

On one side I have that the IPC reports an average of 1.32 issued instructions and 0.62 executed. The instruction serialization is about 50% but the SM activity is almost 100%. On the other hand, there are around 38 active warps but 8 are eligible to execute the next instruction but on warp issue efficiency I get that around 70% of the cycles there is no eligible warp. The stall reasons are reported as "Other" which I think has to do with the computation of the log and sqrt.

How can the SM activity be 99.82% if most of the cycles there is no eligible warp?
How can I reduce stall?
As threads in a warp may not go into the same branch, requests to constant memory are probably seralized, is this true? Should I put those constants in global memory (maybe use shared memory also)?

Is the first time I use Nsight Visual Studio so I'm trying to figure out the meaning of all the performance analysis. BTW my card is a Quadro K4000.

regarding your question 3, I don't see anything wrong with your usage of constant memory. This is a sensible application of constant memory. The warp divergence is an unrelated issue and does not, by itself, result in any "serialization" of accesses to constant memory. All of the threads in a warp on a particular path are executing in lockstep, and those threads will all be serviced simultaneously by a given constant memory request, at least in the code you have shown here. — Robert Crovella
(1) From a performance perspective, it would probably be better to use literal constants instead of __constant__ data. (2) The code seems to compute rational approximations to some mathematical function, and it looks like that function may be closely related to the error function or the CDF of the normal distribution. If so, consider using one of CUDA's erf(), erfc(), erfinv(), erfcinv(), normcdf(), normcdfinv() functions, as appropriate. — njuffa
@BRabbit27: A closer study of the approximations above strongly suggests that they represent the single-precision approximation of the inverse of the cumulative distribution function of the normal distribution. CUDA has a built-in function for that, normcdfinvf(). I would suggest giving that a try to see whether its use can help to improve the performance of this code. — njuffa

Greg Smith Greg Smith · Accepted Answer · 2013-08-28T19:15:44

1) How can the SM activity be 99.82% if most of the cycles there is no eligible warp?

A warp is active if registers and a warp slot are allocated to the warp. A SM is active if at least 1 warp is active on the SM.

SM activity should not be confused with efficiency.

2) How can I reduce stall?

In the case of code above the warps are stalled waiting for the the double precision execution units to be available. The Quadro K4000 has a throughput of 8 threads/cycle for double precision operations.

The remedies for this problem are: a. Decrease the number of double precision operations. For example, moving consecutive operations to float may significantly improve performance as single precision floating point throughput is 24x double precision throughput. b. Execute the kernel on a GK110 which has 8X the double precision throughput of a GK10x.

Increasing the achieved occupancy may not increase the performance of this kernel on the K4000. You have provided insufficient information to determine why achieved occupancy is significantly less than theoretical occupancy.

The Achieved FLOPs experiment can be used to confirm if the kernel performance is bound by double precision throughput.

3) As threads in a warp may not go into the same branch, requests to constant memory are probably seralized, is this true? Should I put those constants in global memory (maybe use shared memory also)?

The code has no memory address divergence in the constant memory loads. Warp control flow divergence just means that on each request on a portion of the threads will be active.

The initial global load may not be coalesced. You need to provide the value of cnPaths for someone to review. You could also look at the Memory experiments or the Source Correlated experiments.

The if and else statement may be able to be coded in a more efficient manner to allow the compiler to use predication instead of divergence branches.

Low performance kernel

3 Answers