CUDA by example program GPU version runs slower or almost the same as CPU version

Question

I am reading the book "CUDA By Example" written by Jason Sanders and Edward Kandrot. Up to now, every program ran smoothly and correctly on my computer, until I stuck at programs of Chapter 9.

The two programs are as follows:

The first program runs only on CPU, hist_cpu.cu:

/*
 * Copyright 1993-2010 NVIDIA Corporation.  All rights reserved.
 *
 * NVIDIA Corporation and its licensors retain all intellectual property and 
 * proprietary rights in and to this software and related documentation. 
 * Any use, reproduction, disclosure, or distribution of this software 
 * and related documentation without an express license agreement from
 * NVIDIA Corporation is strictly prohibited.
 *
 * Please refer to the applicable NVIDIA end user license agreement (EULA) 
 * associated with this source code for terms and conditions that govern 
 * your use of this NVIDIA software.
 * 
 */


#include "../common/book.h"

#define SIZE    (100*1024*1024)

int main( void ) {
    unsigned char *buffer =
                     (unsigned char*)big_random_block( SIZE );

    // capture the start time
    clock_t         start, stop;
    start = clock();

    unsigned int    histo[256];
    for (int i=0; i<256; i++)
        histo[i] = 0;

    for (int i=0; i<SIZE; i++)
        histo[buffer[i]]++;

    stop = clock();
    float   elapsedTime = (float)(stop - start) /
                          (float)CLOCKS_PER_SEC * 1000.0f;
    printf( "Time to generate:  %3.1f ms\n", elapsedTime );

    long histoCount = 0;
    for (int i=0; i<256; i++) {
        histoCount += histo[i];
    }
    printf( "Histogram Sum:  %ld\n", histoCount );

    free( buffer );
    return 0;
}

The second program runs on GPU, hist_gpu_shmem_atomics.cu:

/*
 * Copyright 1993-2010 NVIDIA Corporation.  All rights reserved.
 *
 * NVIDIA Corporation and its licensors retain all intellectual property and 
 * proprietary rights in and to this software and related documentation. 
 * Any use, reproduction, disclosure, or distribution of this software 
 * and related documentation without an express license agreement from
 * NVIDIA Corporation is strictly prohibited.
 *
 * Please refer to the applicable NVIDIA end user license agreement (EULA) 
 * associated with this source code for terms and conditions that govern 
 * your use of this NVIDIA software.
 * 
 */


#include "../common/book.h"

#define SIZE    (100*1024*1024)


__global__ void histo_kernel( unsigned char *buffer,
                              long size,
                              unsigned int *histo ) {

    // clear out the accumulation buffer called temp
    // since we are launched with 256 threads, it is easy
    // to clear that memory with one write per thread
    __shared__  unsigned int temp[256];
    temp[threadIdx.x] = 0;
    __syncthreads();

    // calculate the starting index and the offset to the next
    // block that each thread will be processing
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd( &temp[buffer[i]], 1 );
        i += stride;
    }
    // sync the data from the above writes to shared memory
    // then add the shared memory values to the values from
    // the other thread blocks using global memory
    // atomic adds
    // same as before, since we have 256 threads, updating the
    // global histogram is just one write per thread!
    __syncthreads();
    atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}

int main( void ) {
    unsigned char *buffer =
                     (unsigned char*)big_random_block( SIZE );

    // capture the start time
    // starting the timer here so that we include the cost of
    // all of the operations on the GPU.  if the data were
    // already on the GPU and we just timed the kernel
    // the timing would drop from 74 ms to 15 ms.  Very fast.
    cudaEvent_t     start, stop;
    HANDLE_ERROR( cudaEventCreate( &start ) );
    HANDLE_ERROR( cudaEventCreate( &stop ) );
    HANDLE_ERROR( cudaEventRecord( start, 0 ) );

    // allocate memory on the GPU for the file's data
    unsigned char *dev_buffer;
    unsigned int *dev_histo;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_buffer, SIZE ) );
    HANDLE_ERROR( cudaMemcpy( dev_buffer, buffer, SIZE,
                              cudaMemcpyHostToDevice ) );

    HANDLE_ERROR( cudaMalloc( (void**)&dev_histo,
                              256 * sizeof( int ) ) );
    HANDLE_ERROR( cudaMemset( dev_histo, 0,
                              256 * sizeof( int ) ) );

    // kernel launch - 2x the number of mps gave best timing
    cudaDeviceProp  prop;
    HANDLE_ERROR( cudaGetDeviceProperties( &prop, 0 ) );
    int blocks = prop.multiProcessorCount;
    histo_kernel<<<blocks*2,256>>>( dev_buffer,
                                    SIZE, dev_histo );

    unsigned int    histo[256];
    HANDLE_ERROR( cudaMemcpy( histo, dev_histo,
                              256 * sizeof( int ),
                              cudaMemcpyDeviceToHost ) );

    // get stop time, and display the timing results
    HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
    HANDLE_ERROR( cudaEventSynchronize( stop ) );
    float   elapsedTime;
    HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
                                        start, stop ) );
    printf( "Time to generate:  %3.1f ms\n", elapsedTime );

    long histoCount = 0;
    for (int i=0; i<256; i++) {
        histoCount += histo[i];
    }
    printf( "Histogram Sum:  %ld\n", histoCount );

    // verify that we have the same counts via CPU
    for (int i=0; i<SIZE; i++)
        histo[buffer[i]]--;
    for (int i=0; i<256; i++) {
        if (histo[i] != 0)
            printf( "Failure at %d!\n", i );
    }

    HANDLE_ERROR( cudaEventDestroy( start ) );
    HANDLE_ERROR( cudaEventDestroy( stop ) );
    cudaFree( dev_histo );
    cudaFree( dev_buffer );
    free( buffer );
    return 0;
}

They should run correctly and they do run correctly on some platforms as GeForce GTX 285 (as stated in the book) and K2200 (as the programs were tested by some other user who owns it) with several hundred miliseconds with the CPU program and tens milisecond with the GPU program.

But it produced unusual result on my platforms.

1) Win7_x64 + GeForce 705M

Installed CUDA with the file "353.30-notebook-win8-win7-64bit-international-whql.exe" and ran "nvcc --version" showed

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Wed_Jul_10_13:36:45_PDT_2013
Cuda compilation tools, release 5.5, V5.5.0

2) Ubuntu 10.04(Lucid LTS) + GeForce GT 630

Installed CUDA with the file "cudatoolkit_3.2.16_linux_64_ubuntu10.04.run" and ran "nvcc --version" showed

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Wed_Jul_17_18:36:13_PDT_2013
Cuda compilation tools, release 5.5, V5.5.0

Both platform produced the result of hundreds of miliseconds' run time of CPU program and GPU program. GPU's even longer than CPU some time.

Any cause can be observed?

This is what I saw in my visual studio 2010:

CUDA C/C++ command line:

# (Approximate command-line.  Settings inherited from host are not visible below.)
# (Please see the output window after a build for the full command-line)

# Driver API (NVCC Compilation Type is .cubin, .gpu, or .ptx)
set CUDAFE_FLAGS=--sdk_dir "C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\"
"D:\NVIDIA\CUDA\CUDAToolkit\bin\nvcc.exe" --use-local-env --cl-version 2010 -ccbin "D:\Microsoft Visual Studio 10.0\VC\bin\x86_amd64"        --keep-dir x64\Release -maxrregcount=0  --machine 64 --compile -cudart static  -o x64\Release\%(Filename)%(Extension).obj "%(FullPath)"

# Runtime API (NVCC Compilation Type is hybrid object or .c file)
set CUDAFE_FLAGS=--sdk_dir "C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\" "D:\NVIDIA\CUDA\CUDAToolkit\bin\nvcc.exe" --use-local-env --cl-version 2010 -ccbin "D:\Microsoft Visual Studio 10.0\VC\bin\x86_amd64"        --keep-dir x64\Release -maxrregcount=0  --machine 64 --compile -cudart static      -Xcompiler "/EHsc  /nologo  /Zi    " -o x64\Release\%(Filename)%(Extension).obj "%(FullPath)"

Linker command line:

/OUT:"E:\learn_cuda_by_example_exercises\Chapter9HistGpuShareMemory\x64\Release\Chapter9HistGpuShareMemory.exe" /NOLOGO 

/LIBPATH:"D:\NVIDIA\CUDA\CUDAToolkit\lib\x64" "glut64.lib" "cudart.lib" "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" /MANIFEST /ManifestFile:"x64\Release\Chapter9HistGpuShareMemory.exe.intermediate.manifest" /ALLOWISOLATION 

/MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"E:\learn_cuda_by_example_exercises\Chapter9HistGpuShareMemory\x64\Release\Chapter9HistGpuShareMemory.pdb" /OPT:REF /OPT:ICF /PGD:"E:\learn_cuda_by_example_exercises\Chapter9HistGpuShareMemory\x64\Release\Chapter9HistGpuShareMemory.pgd" /LTCG /TLBID:1 

/DYNAMICBASE /NXCOMPAT /MACHINE:X64 /ERRORREPORT:QUEUE

CUDA Linker command line:

# (Approximate command-line.  Settings inherited from host are not visible below.)
# (Please see the output window after a build for the full command-line)

"D:\NVIDIA\CUDA\CUDAToolkit\bin\nvcc.exe" -dlink -o x64\Release\Chapter9HistGpuShareMemory.device-link.obj -Xcompiler "/EHsc  /nologo  /Zi    "

It looks like you have Optimus in your laptop, are you actually running the program on a dedicated GPU? — Igor Levicki
Is this the official source code from here or did you change anything? How did you exactly compile? which compiler versions are you using (e.g. gcc version)? — m.s.
Not installed Optimus on my laptop, not sure if it runs on that dedicated GPU, how figure it out or make what configuration to confirm it does? — piratesailor
On ubuntu, my gcc version is gcc (Ubuntu 4.4.3-4ubuntu5.1) 4.4.3. — piratesailor

Jez Jez · Accepted Answer · 2015-06-29T14:25:40

I think the main issue here is the hardware.

The GeForce 705M has only 48 CUDA cores. The GeForce GT 630 has 96 or 384, depending on the version. The K2200 has 640. A top of the range card would be a Titan X, with 3072 CUDA cores.

Comparing a 705M to a K2200 it is not really surprising you see a large difference in performance.

CUDA by example program GPU version runs slower or almost the same as CPU version

1 Answers