I have a CUDA kernel where there are many operations and few branches. It looks like
__global__
void kernel(Real *randomValues, Real mu, Real sigma)
{
int row = blockDim.y * blockIdx.y + threadIdx.y;
int col = blockDim.x * blockIdx.x + threadIdx.x;
if ( row >= cnTimeSteps || col >= cnPaths ) return;
Real alphaLevel = randomValues[row*cnPaths+col];
Real q = 0.0;
Real x = 0.0;
if ( alphaLevel < p_low)
{
q = sqrt( -2*log( alphaLevel ) );
x = (((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1);
}
else if ( alphaLevel < p_high )
{
q = alphaLevel-0.5;
Real r = q*q;
x= (((((a1*r+a2)*r+a3)*r+a4)*r+a5)*r+a6)*q / (((((b1*r+b2)*r+b3)*r+b4)*r+b5)*r+1);
}
else
{
q = sqrt( -2*log( 1.0-alphaLevel ) );
x = -(((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1);
}
randomValues[row*cnPaths+col] = sigma * x + mu;
}
where all the a
's, b
's, c
's and d
's are constant values (in the device constant memory)
static __device__ __constant__ Real a1 = 1.73687;
static __device__ __constant__ Real a2 = 1.12321100;
and so on.
After profiling the kernel I found that the theoretical occupancy is 100% but I am getting no more than 60%.
I went through this and this GTC talks to try to optimize my kernel.
On one side I have that the IPC reports an average of 1.32 issued instructions and 0.62 executed. The instruction serialization is about 50% but the SM activity is almost 100%. On the other hand, there are around 38 active warps but 8 are eligible to execute the next instruction but on warp issue efficiency I get that around 70% of the cycles there is no eligible warp. The stall reasons are reported as "Other" which I think has to do with the computation of the log
and sqrt
.
- How can the SM activity be 99.82% if most of the cycles there is no eligible warp?
- How can I reduce stall?
- As threads in a warp may not go into the same branch, requests to constant memory are probably seralized, is this true? Should I put those constants in global memory (maybe use shared memory also)?
Is the first time I use Nsight Visual Studio so I'm trying to figure out the meaning of all the performance analysis. BTW my card is a Quadro K4000.
__constant__
data. (2) The code seems to compute rational approximations to some mathematical function, and it looks like that function may be closely related to the error function or the CDF of the normal distribution. If so, consider using one of CUDA's erf(), erfc(), erfinv(), erfcinv(), normcdf(), normcdfinv() functions, as appropriate. – njuffa