Solving tridiagonal linear systems in CUDA

Question

I am trying to implement a tridiagonal system solver based on the Cyclic Reduction method on my GTS450.

Cyclic Reduction is illustrated in this paper

Y. Zhang, J. Cohen, J.D. Owens, "Fast Tridiagonal Solvers on GPU"

However, whatever I do, my CUDA code is far slower than the sequential counterpart. My result for a total of 512 x 512 points is 7ms, however on my i7 3.4GHz it is 5ms. The GPU is not accelerating!

Which could be the problem?

#include "cutrid.cuh"
 __global__ void cutrid_RC_1b(double *a,double *b,double *c,double *d,double *x)
{
 int idx_global=blockIdx.x*blockDim.x+threadIdx.x;
 int idx=threadIdx.x;

 __shared__ double asub[512];
 __shared__ double bsub[512];
 __shared__ double csub[512];
 __shared__ double dsub[512];

 double at=0;
 double bt=0;
 double ct=0;
 double dt=0;

 asub[idx]=a[idx_global];
 bsub[idx]=b[idx_global];
 csub[idx]=c[idx_global];
 dsub[idx]=d[idx_global];


 for(int stride=1;stride<N;stride*=2)
  {
    int margin_left,margin_right;
    margin_left=idx-stride;
    margin_right=idx+stride;


    at=(margin_left>=0)?(-csub[idx-stride]*asub[idx]/bsub[idx-stride]):0.f; 

    bt=bsub[idx]+((margin_left>=0)?(-csub[idx-stride]*asub[idx]/bsub[idx-stride]):0.f)
    -((margin_right<512)?asub[idx+stride]*csub[idx]/bsub[idx+stride]:0.f); 

    ct=(margin_right<512)?(-csub[idx+stride]*asub[idx]/bsub[idx+stride]):0.f; 

    dt=dsub[idx]+((margin_left>=0)?(-dsub[idx-stride]*asub[idx]/bsub[idx-stride]):0.f)
    -((margin_right<512)?dsub[idx+stride]*csub[idx]/bsub[idx+stride]:0.f); 

    __syncthreads();
    asub[idx]=at;
    bsub[idx]=bt;
    csub[idx]=ct;
    dsub[idx]=dt;
    __syncthreads();
  }


x[idx_global]=dsub[idx]/bsub[idx];

}/*}}}*/

I launched this kernel by cutrid_RC_1b<<<512,512>>>(d_a,d_b,d_c,d_d,d_x), and reached 100% device occupancy. This result has puzzled me for days.

There is an improved version of my code:

    #include "cutrid.cuh"
    __global__ void cutrid_RC_1b(float *a,float *b,float *c,float *d,float *x)
    {/*{{{*/
     int idx_global=blockIdx.x*blockDim.x+threadIdx.x;
     int idx=threadIdx.x;

     __shared__ float asub[512];
     __shared__ float bsub[512];
     __shared__ float csub[512];
     __shared__ float dsub[512];

    asub[idx]=a[idx_global];
    bsub[idx]=b[idx_global];
    csub[idx]=c[idx_global];
    dsub[idx]=d[idx_global];
 __syncthreads();
   //Reduction  
    for(int stride=1;stride<512;stride*=2)
    {
        int margin_left=(idx-stride);
        int margin_right=(idx+stride);
        if(margin_left<0) margin_left=0;
        if(margin_right>=512) margin_right=511;
        float tmp1 = asub[idx] / bsub[margin_left];
        float tmp2 = csub[idx] / bsub[margin_right];
        float tmp3 = dsub[margin_right];
        float tmp4 = dsub[margin_left];
        __syncthreads();

        dsub[idx] = dsub[idx] - tmp4*tmp1-tmp3*tmp2;
        bsub[idx] = bsub[idx]-csub[margin_left]*tmp1-asub[margin_right]*tmp2;

        tmp3 = -csub[margin_right]; 
        tmp4 = -asub[margin_left];

        __syncthreads();
        asub[idx] = tmp3*tmp1;
        csub[idx] = tmp4*tmp2;
        __syncthreads();
     }

        x[idx_global]=dsub[idx]/bsub[idx];

    }/*}}}*/

The speed is improved to 0.73ms on a Quadro k4000 for 512 x 512 system, however the code in the mentioned paper runs in 0.5ms on a GTX280.

You didn't write it correctly. Can you tell anything about chunks of work, GPU loads, etc.? — duffymo
Your code is full of conditional statements, which perhaps is one of the main limiting factors. — Vitality
@duffymo i launched my kernel by cutrid_RC_1b<<<512,512>>>(d_a,d_b,d_c,d_d,d_x). In this way i reached 100% occupancy on gts480. — pengjun
@JackOLantern the conditional statements are inevitable for this algorithm. there may be ways to improve it, but i haven't figured out. — pengjun
You are comparing a low-end GPU and a fairly high-end CPU. You may want to compare the raw double-precision GFLOPS of both devices to get an idea whether acceleration is likely assuming well-optimized code is run on both CPU and GPU. — njuffa

Vitality Vitality · Accepted Answer · 2013-10-23T21:17:59

Solving a tridiagonal system of equations is a challenging parallel problem since the classical solution scheme, i.e., Gaussian elimination, is inherently sequential.

Cyclic Reduction consists of two phases:

Forward Reduction. The original system is split in two independent tridiagonal systems for two sets of unknowns, the ones with odd index and the ones with even index. Such systems can be solved independently and this step can be seen as the ﬁrst of a divide et impera scheme. The two smaller systems are split again in the same way in two subsystems and the process is repeated until a system of only 2 equations is reached.
Backward Substitution. The system of 2 equations is solved first. Then, the divide et impera structure is climbed up by solving the sub-systems independently on diﬀerent cores.

I'm not sure (but correct me if I'm wrong) that your code will return consistent results. N does not appear to be defined. Also, you are accessing csub[idx-stride], but I'm not sure what does it mean when idx==0 and stride>1. Furthermore, you are using several conditional statements, essentially for boundary checkings. Finally, your code lacks a proper thread structure capable to deal with the mentioned divide et impera scheme, conceptually pretty much like the one used in the CUDA SDK reduction samples.

As mentioned in one of my comments above, I remembered that at tridiagonalsolvers you can find an implementation of the Cyclic Reduction scheme for solving tridiagonal equation systems. Browsing the related google pages, it seems to me that the code is mantained, among others, by the first Author of the above paper (Yao Zhang). The code is copied and pasted below. Note that the boundary check is done only once (if (iRight >= systemSize) iRight = systemSize - 1;), thus limiting the number of conditional statements involved. Note also the thread structure capable to deal with a divide et impera scheme.

The code by Zhang, Cohen and Owens

__global__ void crKernel(T *d_a, T *d_b, T *d_c, T *d_d, T *d_x)
{
   int thid = threadIdx.x;
   int blid = blockIdx.x;

   int stride = 1;

   int numThreads = blockDim.x;
   const unsigned int systemSize = blockDim.x * 2;

   int iteration = (int)log2(T(systemSize/2));
   #ifdef GPU_PRINTF 
    if (thid == 0 && blid == 0) printf("iteration = %d\n", iteration);
   #endif

   __syncthreads();

   extern __shared__ char shared[];

   T* a = (T*)shared;
   T* b = (T*)&a[systemSize];
   T* c = (T*)&b[systemSize];
   T* d = (T*)&c[systemSize];
   T* x = (T*)&d[systemSize];

   a[thid] = d_a[thid + blid * systemSize];
   a[thid + blockDim.x] = d_a[thid + blockDim.x + blid * systemSize];

   b[thid] = d_b[thid + blid * systemSize];
   b[thid + blockDim.x] = d_b[thid + blockDim.x + blid * systemSize];

   c[thid] = d_c[thid + blid * systemSize];
   c[thid + blockDim.x] = d_c[thid + blockDim.x + blid * systemSize];

   d[thid] = d_d[thid + blid * systemSize];
   d[thid + blockDim.x] = d_d[thid + blockDim.x + blid * systemSize];

   __syncthreads();

   //forward elimination
   for (int j = 0; j <iteration; j++)
   {
       __syncthreads();
       stride *= 2;
       int delta = stride/2;

    if (threadIdx.x < numThreads)
    {
        int i = stride * threadIdx.x + stride - 1;
        int iLeft = i - delta;
        int iRight = i + delta;
        if (iRight >= systemSize) iRight = systemSize - 1;
        T tmp1 = a[i] / b[iLeft];
        T tmp2 = c[i] / b[iRight];
        b[i] = b[i] - c[iLeft] * tmp1 - a[iRight] * tmp2;
        d[i] = d[i] - d[iLeft] * tmp1 - d[iRight] * tmp2;
        a[i] = -a[iLeft] * tmp1;
        c[i] = -c[iRight] * tmp2;
    }
       numThreads /= 2;
   }

   if (thid < 2)
   {
     int addr1 = stride - 1;
     int addr2 = 2 * stride - 1;
     T tmp3 = b[addr2]*b[addr1]-c[addr1]*a[addr2];
     x[addr1] = (b[addr2]*d[addr1]-c[addr1]*d[addr2])/tmp3;
     x[addr2] = (d[addr2]*b[addr1]-d[addr1]*a[addr2])/tmp3;
   }

   // backward substitution
   numThreads = 2;
   for (int j = 0; j <iteration; j++)
   {
       int delta = stride/2;
       __syncthreads();
       if (thid < numThreads)
       {
           int i = stride * thid + stride/2 - 1;
           if(i == delta - 1)
                 x[i] = (d[i] - c[i]*x[i+delta])/b[i];
           else
                 x[i] = (d[i] - a[i]*x[i-delta] - c[i]*x[i+delta])/b[i];
        }
        stride /= 2;
        numThreads *= 2;
     }

   __syncthreads();

   d_x[thid + blid * systemSize] = x[thid];
   d_x[thid + blockDim.x + blid * systemSize] = x[thid + blockDim.x];

}

Solving tridiagonal linear systems in CUDA

3 Answers