I have coded a simple tiled matrix multiplication in CUDA. It's like this:
__global__ void matrixMultiplyShared(float * A, float * B, float * C,
                         int numARows, int numAColumns,
                         int numBRows, int numBColumns,
                         int numCRows, int numCColumns) {
    __shared__ float ds_A[TILE_WIDTH][TILE_WIDTH];
    __shared__ float ds_B[TILE_WIDTH][TILE_WIDTH];
    int bx = blockIdx.x; int by = blockIdx.y;
    int tx = threadIdx.x; int ty = threadIdx.y;
    int row = by * TILE_WIDTH + ty;
    int col = bx * TILE_WIDTH + tx;
    float Cvalue = 0.0;
// Loop over the M and N tiles required to compute the Pd element
    for (int m = 0; m < (numAColumns-1)/TILE_WIDTH+1; ++m) {
        if(row<numARows && m*TILE_WIDTH+tx < numAColumns){
            ds_A[ty][tx] = A[row*numAColumns + m*TILE_WIDTH+tx];
        } else {
            ds_A[ty][tx] = 0;
        }
        if(m*TILE_WIDTH+ty < numBRows && col < numBColumns){
            ds_B[ty][tx] = B[(m*TILE_WIDTH+ty)*numBColumns+col];
        } else {
            ds_B[ty][tx] = 0;
        }
        __syncthreads();
        if(row < numCRows && col < numCColumns){
            for (int k = 0; k < TILE_WIDTH; ++k)
                Cvalue += ds_A[ty][k] * ds_B[k][tx];
        }
        __syncthreads();
    }
    if(row < numCRows && col < numCColumns)
        C[row*numCColumns+col] = Cvalue;
}
After that, I used the same above kernel (with some minor changes) in the OpenCL version to compare the performance of CUDA and OpenCL together. But the result was to so far beyond my expectations. OpenCL was 6-7 times faster than CUDA. Is it valid? The output of Nisght is as follows:
CUDA:

OpenCL:

You can see a large gap between starting the app and executing the kernel. why is that happened?
My GPU is: GTX 580 | The Kernel Ex time (CUDA): 3.78s | The Kernel Ex time (OpenCL): 0.53s |
CUDA Code: http://pastebin.com/VQMp3Hba
OpenCL Host Code: http://pastebin.com/cjGYSLQf
OpenCL Kernel Code: http://pastebin.com/KKw3Ayz7