count3's in cuda is very slow

Question

I have written a small program in CUDA that counts how many 3's are in a C array and prints them.

#include <stdio.h>
#include <assert.h>
#include <cuda.h>
#include <cstdlib>

__global__ void incrementArrayOnDevice(int *a, int N, int *count)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;

    //__shared__ int s_a[512]; // one for each thread
    //s_a[threadIdx.x] = a[id];

    if( id < N )
    {
        //if( s_a[threadIdx.x] == 3 )
        if( a[id] == 3 )
        {
            atomicAdd(count, 1);
        }
    }
}

int main(void)
{
    int *a_h;   // host memory
    int *a_d;   // device memory

    int N = 16777216;

    // allocate array on host
    a_h = (int*)malloc(sizeof(int) * N);
    for(int i = 0; i < N; ++i)
        a_h[i] = (i % 3 == 0 ? 3 : 1);

    // allocate arrays on device
    cudaMalloc(&a_d, sizeof(int) * N);

    // copy data from host to device
    cudaMemcpy(a_d, a_h, sizeof(int) * N, cudaMemcpyHostToDevice);

    // do calculation on device
    int blockSize = 512;
    int nBlocks = N / blockSize + (N % blockSize == 0 ? 0 : 1);
    printf("number of blocks: %d\n", nBlocks);

    int count;
    int *devCount;
    cudaMalloc(&devCount, sizeof(int));
    cudaMemset(devCount, 0, sizeof(int));

    incrementArrayOnDevice<<<nBlocks, blockSize>>> (a_d, N, devCount);

    // retrieve result from device
    cudaMemcpy(&count, devCount, sizeof(int), cudaMemcpyDeviceToHost);

    printf("%d\n", count);

    free(a_h);
    cudaFree(a_d);
    cudaFree(devCount);
}

The result I get is: real 0m3.025s user 0m2.989s sys 0m0.029s

When I run it on the CPU with 4 threads I get: real 0m0.101s user 0m0.100s sys 0m0.024s

Note that the GPU is an old one - I don't know the exact model because I do not have root access to it, but the OpenGL version it runs is 1.2 using the MESA driver.

Am I doing something wrong? What can I do to make it run faster?

Note: I have tried using buckets for each block (so the atomicAdd()s would be reduced for each one) but I get exactly the same performance. I have also tried copying the 512 integers that are assigned to this block to a shared block of memory (you can see it in the comments) and the time is the same again.

I doubt 512 integers is enough for CUDA to outperform the CPU. — user703016
This isn't a particularly interesting test of GPU computing because there is essentially no arithmetic intensity per element (so we can't really tap into the GPU parallel computing power) and there is essentially no data re-use, so we can't tap into the memory bandwidth advantages that GPUs frequently have. Atomic operations on GPUs typically involve a performance hit, so operations like this that produce a single (scalar) result on a data set are usually better realized as a classical parallel reduction operation, which doesn't require the use of atomics. — Robert Crovella
In addition, the huge disparity in time (3s vs. 0.1s) is not because the GPU is running this test 30 times slower. You may be using the linux time function for timing, which is timing the execution time of the entire app. There are significant start-up overheads associated with using a GPU which can account for fractions of a second to several seconds depending on how the GPU is configured. If you work on a large problem (that takes minutes or longer to process) this is not an issue. But when you work on a problem that takes 0.1s to complete, your timing is swamped by the startup overhead. — Robert Crovella
@RobertCrovella thank you. I will use the cuda timing functions. — Minas Mina

Robert Crovella Robert Crovella · Accepted Answer · 2013-04-01T16:29:03

This is in response to your question "What can I do to make it run faster?" As I mentioned in the comments, there are issues (probably) with the timing methodology, and the main suggestion I have for speed improvement is to use a "classical parallel reduction" algorithm. The following code implements a better (in my opinion) timing measurement, and also converts your kernel to a reduction style kernel:

#include <stdio.h>
#include <assert.h>
#include <cstdlib>


#define N (1<<24)
#define nTPB 512
#define NBLOCKS 32

__global__ void incrementArrayOnDevice(int *a, int n, int *count)
{
  __shared__ int lcnt[nTPB];
  int id = blockIdx.x * blockDim.x + threadIdx.x;
  int lcount = 0;
  while (id < n) {
    if (a[id] == 3) lcount++;
    id += gridDim.x * blockDim.x;
    }
  lcnt[threadIdx.x] = lcount;
  __syncthreads();
  int stride = blockDim.x;
  while(stride > 1) {
    // assume blockDim.x is a power of 2
    stride >>= 1;
    if (threadIdx.x < stride) lcnt[threadIdx.x] += lcnt[threadIdx.x + stride];
    __syncthreads();
    }
  if (threadIdx.x == 0) atomicAdd(count, lcnt[0]);
}

int main(void)
{
    int *a_h;   // host memory
    int *a_d;   // device memory
    cudaEvent_t gstart1,gstart2,gstop1,gstop2,cstart,cstop;
    float etg1, etg2, etc;

    cudaEventCreate(&gstart1);
    cudaEventCreate(&gstart2);
    cudaEventCreate(&gstop1);
    cudaEventCreate(&gstop2);
    cudaEventCreate(&cstart);
    cudaEventCreate(&cstop);

    // allocate array on host
    a_h = (int*)malloc(sizeof(int) * N);
    for(int i = 0; i < N; ++i)
        a_h[i] = (i % 3 == 0 ? 3 : 1);

    // allocate arrays on device
    cudaMalloc(&a_d, sizeof(int) * N);

    int blockSize = nTPB;
    int nBlocks = NBLOCKS;
    printf("number of blocks: %d\n", nBlocks);

    int count;
    int *devCount;
    cudaMalloc(&devCount, sizeof(int));
    cudaMemset(devCount, 0, sizeof(int));

    // copy data from host to device
    cudaEventRecord(gstart1);
    cudaMemcpy(a_d, a_h, sizeof(int) * N, cudaMemcpyHostToDevice);
    cudaMemset(devCount, 0, sizeof(int));
    cudaEventRecord(gstart2);
    // do calculation on device

    incrementArrayOnDevice<<<nBlocks, blockSize>>> (a_d, N, devCount);
    cudaEventRecord(gstop2);

    // retrieve result from device
    cudaMemcpy(&count, devCount, sizeof(int), cudaMemcpyDeviceToHost);
    cudaEventRecord(gstop1);

    printf("GPU count = %d\n", count);
    int hostCount = 0;
    cudaEventRecord(cstart);
    for (int i=0; i < N; i++)
      if (a_h[i] == 3) hostCount++;
    cudaEventRecord(cstop);

    printf("CPU count = %d\n", hostCount);
    cudaEventSynchronize(cstop);
    cudaEventElapsedTime(&etg1, gstart1, gstop1);
    cudaEventElapsedTime(&etg2, gstart2, gstop2);
    cudaEventElapsedTime(&etc, cstart, cstop);

    printf("GPU total time   = %fs\n", (etg1/(float)1000) );
    printf("GPU compute time = %fs\n", (etg2/(float)1000));
    printf("CPU time         = %fs\n", (etc/(float)1000));
    free(a_h);
    cudaFree(a_d);
    cudaFree(devCount);
}

When I run this on a reasonably fast GPU (a Quadro 5000, a little slower than a Tesla M2050) I get the following:

number of blocks: 32
GPU count = 5592406
CPU count = 5592406
GPU total time   = 0.025714s
GPU compute time = 0.000793s
CPU time         = 0.017332s

We see that the GPU is substantially faster than this (naive, single-threaded) CPU implementation for the compute portion. When we add in the cost to transfer the data, the GPU version is slower but is not 30x slower.

By way of comparison, when I timed your original algorithm, I got numbers like this:

GPU total time   = 0.118131s
GPU compute time = 0.093213s

My system config for this was Xeon X5560 CPU, RHEL 5.5, CUDA 5.0, Quadro5000 GPU.

count3's in cuda is very slow

1 Answers