Pointless use of CUDA shared memory

Question

I have two versions of the same algorithm. It was originally the convolution but I modified it to reduce it to this to check where is my bottle neck (note that there is a single access to global memory per loop):

__global__
void convolve (unsigned char * Md, float * Kd, unsigned char * Rd, int width, int height, int kernel_size, int tile_width, int channels){

int row = blockIdx.y*tile_width + threadIdx.y;
int col = blockIdx.x*tile_width + threadIdx.x;

int sum = 0;
int pixel;
int local_pixel;
int working_pixel;

int row_offset = (kernel_size/2)*(width+kernel_size-1);
int col_offset = kernel_size/2;

for(int color=0; color<channels; color++){

    pixel = color*width*height + row*width + col;
    local_pixel = color*(width+kernel_size-1)*(height+kernel_size-1) + row*(width+kernel_size-1) + col + row_offset + col_offset;
    if(row < height  &&  col < width){
        for(int x=(-1)*kernel_size/2; x<=kernel_size/2; x++)
            for(int y=(-1)*kernel_size/2; y<=kernel_size/2; y++){
                working_pixel = local_pixel + x + y*(width+kernel_size-1);
                sum += (int)((float)Md[working_pixel]);// * ((float)Kd[x+kernel_size/2 + (y+kernel_size/2)*kernel_size]);

            }
        Rd[pixel] = (int) sum;
        sum = 0;
    }
}
}

and this is the shared memory version (one single access to shared memory per loop)

__global__
void convolve (unsigned char * Md, float * Kd, unsigned char * Rd, int width, int height, int kernel_size, int tile_width, int channels){

__shared__ unsigned char Mds[256 + 16*4 +4];

int row = blockIdx.y*tile_width + threadIdx.y;
int col = blockIdx.x*tile_width + threadIdx.x;

if(row < height  &&  col < width){

    int sum = 0;
    int pixel;  //the pixel to copy from Md (the input image)
    int local_pixel;    //the pixel in shared memory

    int start_pixel;    //the offset to copy the borders

    int mds_width = tile_width+kernel_size-1;
    int md_width = width+kernel_size-1;
    int md_height = height+kernel_size-1;

    for(int color=0; color<channels; color++){

        pixel = color*md_width*md_height + row*md_width + col  +  (kernel_size/2)*md_width + kernel_size/2; //position (including borders) + offset
        local_pixel = threadIdx.y*mds_width + threadIdx.x  +  (kernel_size/2)*mds_width + kernel_size/2;    //position + offset


        //Loading the pixels
        Mds[local_pixel] = Md[pixel];//bringing the central pixel itself (position + offset)


        __syncthreads();

        //Convolving
        for(int x=(-1)*kernel_size/2; x<=kernel_size/2; x++)
            for(int y=(-1)*kernel_size/2; y<=kernel_size/2; y++)
                sum += (int)((float)Mds[local_pixel + x + y*mds_width]); // * ((float)Kd[x+kernel_size/2 + (y+kernel_size/2)*kernel_size]);
        Rd[color*width*height + row*width + col] = (int) sum;
        sum = 0;

        __syncthreads();

    }
}
}

the executions parameters are

convolve<<<dimGrid,dimBlock>>>(Md,Kd,Rd,width,new_height,kernel_size,block_size,colors);

dimGrid = (1376,768)
dimBlock = (16,16)
Md is the read only image
Kd is the filter (3x3)
width = 22016
height = 12288
kernel_size = 3
block_size=16
colors=3

I obtain 1249.59 ms with the first algorithm and 1178.2 ms with the second one, which I find ridiculous. I think that the number of registers should not be a problem. Compiling with ptxas I get:

ptxas info: 560 bytes gmem, 52 bytes cmem[14]
ptxas info: Compiling entry function '_Z8convolvePhPfS_iiiii' for 'sm_10'
ptxas info: Used 16 registers, 384 bytes smem, 4 bytes cmem[1]

while the info of my device is:

Name: GeForce GTX 660 Ti
Minor Compute Capability: 0
Major Compute Capability: 3
Warp Size: 32
Max Treads per Block: 1024
Max Threads Dimension: (1024,1024,64)
Max Grid Size: (2147483647,65535,65535)
Number of SM: 7
Max Threads Per SM: 2048
Regs per Block (SM): 65536
Total global Memory: 2146762752
Shared Memory per Block: 49152

Does anyone remotely have any hint about this poor improvement? I don't know anybody else to ask..

EDIT: I'm using today a different nvidia card since I cannot access the lab. It also has compute capability 3.0. I put both if statements out of the loop. I'm compiling with -arch compute_30 -code sm_30 I remove all the castings. The global matrix is now declared as const unsigned char * restrict Md I used this time a 9x9 filter which makes each pixel be reused 81 times after be brought in shared memory.

I get 3138.41 ms (global version) and 3120.96 ms (shared version) from the terminal. In the visual profiler it takes longer. This is what I get (screenshots) http://cl.ly/image/1X372l242S2u

as lost as I was..

Please find here this algorithm easy to compile and execute:

http://cl.ly/213l2X3S1v3a

./convolution 8000 4000 159 9 edge_detection_9.txt 0 for the global memory version ./convolution 8000 4000 159 9 edge_detection_9.txt 1 for the shared memory version

My suggestion: Provide a short, complete code. That is a code that I can copy, paste, compile, and run, without adding anything or changing anything, to see the comparison in execution time. — Robert Crovella
This is what the CUDA profiling tools are for. Have a look at nvprof or Nvidia Visual Profiler. Both of these will allow you to identify the performance bottleneck on your kernel. It would not surprise me that this kernel is arithmetic bound. You are performing many operations multiple times in the kernel that could be cached (there is no guarantee that the compiler will optimize this for you). — ebarr
As Robert says, we really need a complete reproducing example. I can't be sure from the above, for example, how you are timing the kernel. It may be there is a flaw here? Without being able to reproduce your results it is much harder to help you. — Jez
Please find the file I attached on the last link of the edited post. @Jez — Mr. K

VAndrei VAndrei · Accepted Answer · 2014-11-15T22:26:16

First thing that draw my attention:

ptxas info: Compiling entry function '_Z8convolvePhPfS_iiiii' for 'sm_10'

Your card has compute capability 3.0 so you should compile with sm_30. sm_10 lacked many features of sm_30, had smaller shared memory and less registers.

The next thing i would do is to put the if statement in both kernels outside of the for loop, for proper kernel comparison.

Next, I would increase the kernel_size to highlight the impact of shared memory. You have only 9 accesses (if I counted correctly) in your kernels and this means the following:

In the first kernel, you read the elements from global memory directly into registers and use them
In the second kernel each thread reads one element from global memory and each thread makes 9 accesses in the shared cache.
Since you don't make heavy reuse of the elements in the shared cache, the price you pay for accessing the global memory is too big.

In addition sum += (int)((float)Mds[local_pixel + x + y*mds_width]); generates some bank conflicts into the shared cache decreasing its throughput.

If your kernel_size is always 3, you could also replace the for loops by unrolling them and using fixed indexes, to help the compiler.

I'm also concerned about the penalty of the cast from uchar to float to int. I know that these operations are costly, further bringing down the shared cache usage gain. Why do you cast for example (int) sum; since Rd is unsigned char? Why don't declare Rd as an int* ?

I see that Kd is also needed in your kernel. As it is declared now it's stored in the global memory. If it's only a 3x3 filter, you could hardcode it or load it outside of the loop in a thread local variable, that has some chances of being stored into registers.

If that does not work, you could try storing the coefficients into shared memory. Actually replicate the coeffs. there for each thread so that you avoid bank conflicts. The shared memory has 32 ports on Kepler so that will allow extraction of the coefficients symultaneously on all threads in the warp.

In conclusion I think that your shared cache kernel pays the price of acessing global memory, shared memory bank conflicts, using sm_10 and multiple type casts so that the gain of shared cache is greatly diminished. One general recommendation, use the CUDA Visual Profiler to validate these points.

Also, I would try using the texture cache by declaring Md as const __restrict__ . That could show some speedup compared to global memory access, because that's a multi-ported cache with a special mapping designed to reduce bank conflicts. Actually I expect this to work better even than the shared memory case.

Pointless use of CUDA shared memory

1 Answers