Unordinary performance gap between OpenCL and CUDA

Question

I have coded a simple tiled matrix multiplication in CUDA. It's like this:

__global__ void matrixMultiplyShared(float * A, float * B, float * C,
                         int numARows, int numAColumns,
                         int numBRows, int numBColumns,
                         int numCRows, int numCColumns) {

    __shared__ float ds_A[TILE_WIDTH][TILE_WIDTH];
    __shared__ float ds_B[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x; int by = blockIdx.y;
    int tx = threadIdx.x; int ty = threadIdx.y;

    int row = by * TILE_WIDTH + ty;
    int col = bx * TILE_WIDTH + tx;

    float Cvalue = 0.0;

// Loop over the M and N tiles required to compute the Pd element
    for (int m = 0; m < (numAColumns-1)/TILE_WIDTH+1; ++m) {
        if(row<numARows && m*TILE_WIDTH+tx < numAColumns){
            ds_A[ty][tx] = A[row*numAColumns + m*TILE_WIDTH+tx];
        } else {
            ds_A[ty][tx] = 0;
        }
        if(m*TILE_WIDTH+ty < numBRows && col < numBColumns){
            ds_B[ty][tx] = B[(m*TILE_WIDTH+ty)*numBColumns+col];
        } else {
            ds_B[ty][tx] = 0;
        }
        __syncthreads();
        if(row < numCRows && col < numCColumns){
            for (int k = 0; k < TILE_WIDTH; ++k)
                Cvalue += ds_A[ty][k] * ds_B[k][tx];
        }
        __syncthreads();
    }
    if(row < numCRows && col < numCColumns)
        C[row*numCColumns+col] = Cvalue;
}

After that, I used the same above kernel (with some minor changes) in the OpenCL version to compare the performance of CUDA and OpenCL together. But the result was to so far beyond my expectations. OpenCL was 6-7 times faster than CUDA. Is it valid? The output of Nisght is as follows:

CUDA: CUDA Nisght output: Kernel Ex time: 3.78s

OpenCL: CUDA Nisght output: Kernel Ex time: 0.53s

You can see a large gap between starting the app and executing the kernel. why is that happened?

My GPU is: GTX 580 | The Kernel Ex time (CUDA): 3.78s | The Kernel Ex time (OpenCL): 0.53s |

CUDA Code: http://pastebin.com/VQMp3Hba

OpenCL Host Code: http://pastebin.com/cjGYSLQf

OpenCL Kernel Code: http://pastebin.com/KKw3Ayz7

I think you should add the code for the OpenCL implementation too, then there would at least be chance someone spots the cause of the difference... — ppeterka
I have added the codes. OpenCL: pastebin.com/cjGYSLQf CUDA: pastebin.com/VQMp3Hba — Arya Mz

Rahul Rahul · Accepted Answer · 2013-01-21T13:01:53

You can try and insert explicit timers in the code instead of trusting the output from the tool. May be the case that the tool is wrong.

Unordinary performance gap between OpenCL and CUDA

1 Answers