1
votes

The following has been troubling me.

Running the same kernel with two different devices, one with compute capability 1.3 and the other with compute capability 2.0, I get better performance with more threads per block (high occupancy) in the 1.3 but the opposite in the 2.0. The peak of performance for the 2.0 seems to be 16 threads per block, an occupancy of 17% Anything less or anything greater than this point has worst performance.

Since it's most likely the cause of this to be the nature of the kernel itself here it is.

__global__ void
kernel_CalculateRFCH (int xstart, int ystart, int xsize,
          int ysize, int imxsize, int imysize, int *test, int *dev_binIm, int *per_block_results)
{
  int x2, y2, bin, bin2;
  __shared__ int s_pixels[blockDim.x*blockDim.y];  //this wouldn't compile in reailty

  int tx = threadIdx.x;
  int ty = threadIdx.y;
  int tidy = threadIdx.y + blockIdx.y * blockDim.y;
  int tidx = threadIdx.x + blockIdx.x * blockDim.x;

  if (xstart + xsize > imxsize)
    xsize = imxsize - xstart;
  if (ystart + ysize > imysize)
    ysize = imysize - ystart;

  s_pixels[tx * blockDim.y + ty] = 0;

  if (tidy >= ystart && tidy < ysize + ystart && tidx >= xstart && tidx < xsize + xstart)
{
      bin = dev_binIm[tidx + tidy * imxsize];

      if (bin >= 0)
    {
      x2 = tidx;
      y2 = tidy;

         while (y2 < ystart + ysize)
          {
          if (x2 >= xstart + xsize || x2 - tidx > 10)
             {
                  x2 = xstart;
                  y2++;
                  if (tidx - x2 > 10)
                   x2 = tidx - 10;
                  if (y2 - tidy > 10)
                   {
                      y2 = ystart + ysize;
                      break;
                   }
                   if (y2 >= ystart + ysize)
                      break;
              }

          bin2 = dev_binIm[x2 + y2 * imxsize];

           if (bin2 >= 0)
              {
               test[(tidx + tidy * imxsize) * 221 + s_pixels[tx * blockDim.y + ty]] = bin + bin2 * 80;
               s_pixels[tx * blockDim.y + ty]++;
              }
          x2++;
        }           
     }          

  } 

  for (int offset = (blockDim.x * blockDim.y) / 2; offset > 0; offset >>= 1)
    {
     if ((tx * blockDim.y + ty) < offset)
       {
         s_pixels[tx * blockDim.y + ty] += s_pixels[tx * blockDim.y + ty + offset];
       }
      __syncthreads ();
     }

   if (tx * blockDim.y + ty == 0)
     {
        per_block_results[blockIdx.x * gridDim.y + blockIdx.y] = s_pixels[0];

     }

}

I use 2-D threading.

ptxas info : Compiling entry function '_Z20kernel_CalculateRFCHiiiiiiPiS_' for 'sm_10' ptxas info : Used 16 registers, 128 bytes smem, 8 bytes cmem[1] .

16 registers is shown in every case in every device.

Any ideas for why this could be happening would be very enlightning.

1
Are you aware of the work of Vasily Volkov? Your question's title is very reminiscent of his "better performance at lower occupancy" presentation. - tera
Threads per block are a factor in determining occupancy, but there is no direct relation such that increasing the number of threads per block increases occupancy (and as you have found, there is no direct relation between occupancy and performance either). Use the Occupancy Calculator to find out the occupancy for your kernel. - tera
Finally, 16 threads per block is far too low, as threads are scheduled in warps of 32 threads. So using just 16 threads uses only half of the available resources (probably even less for other reasons, good block sizes often are between 64..256 threads per block). Are you sure you haven't interchanged the "threads per block" and the "number of blocks" arguments? - tera
I haven't interchanged the arguments. As I see from the occupancy calculator for my case when I increase the threads per block occupancy increases too, and vice versa. For the 1.3 device, when I use 100% occupancy I get better performance. Using the 2.0 device when I use 17% (16 threads per block) occupancy I get the peak performance. Lesser or greater occupancy than this generates worst performance. Not needed to say when I decrease the number of threads per block the number of blocks must increase for my kernel. Why does this happen on the 2.0 and not on the 1.3? - user1280671
I agree with tera on both points: 1. better perf at lower occupancy is not an unheard of concept. 2. better perf with 16 threads per block instead of 32 or a multiple of 32 on a cc 2.0 device seems pretty unlikely. - Robert Crovella

1 Answers

1
votes

Apart from the general remarks made above, your kernel is a very special case as most of the threads don't do any work at all. Why don't you add xstart and ystart to tidx and tidy straight away and choose a smaller grid? Your better performance at smaller blocksize might just be an artefact of how the region of interest is split into blocks.

This also explains why you see a big difference between compute capability 1.x devices versus CC 2.0+ devices. Beginning with CC 2.0 Nvidia GPUs have become a lot better at handling kernels where the runtime varies largely between blocks.
On compute capability 1.x, a new wave of blocks is only scheduled once all currently running blocks have finished, while from CC 2.0 on a new block is started as soon as any old block has finished.