0
votes

I have an application that is made up of multiple CPU threads whereby each CPU Thread creates a separate cudaStream in the same cudaContext on my GPU. I have a Tesla K20c. I'm using Windows 7 64 bit and Cuda 5.5.

Here is my code:

#include "gpuCode.cuh"

__global__ void kernelAddConstant1(int *g_a, const int b)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    g_a[idx] += b;
    for (int i = 0; i < 4000000.0; i++)
    {
        if (i%2 == 0)
        {
            g_a[idx] += 5;
        }
        else
        {
            g_a[idx] -= 5;
        }
    }
}


// a predicate that checks whether each array elemen is set to its index plus b
int correctResult(int *data, const int n, const int b)
{
    for (int i = 0; i < n; i++)
    {
        if (data[i] != i + b)
        {
            return 0;
        }
    }
    return 11;
}

int gpuDo()
{
    cudaSetDevice(0);
    cudaStream_t stream;
    cudaStreamCreate( &stream );

    int *a;
    int *d_a;

    unsigned int n;
    unsigned int nbytes;

    int b;

    n = 2 * 8192/16;
    nbytes = n * sizeof(int);
    b = 7;      // value by which the array is incremented

    cudaHostAlloc( (void**)&a, nbytes, cudaHostAllocDefault ) ;
    cudaMalloc((void **)&d_a, nbytes);

    for (unsigned int i = 0; i < n; i++)
        a[i] = i;

    unsigned int nbytes_per_kernel = nbytes;
    dim3 gpu_threads(128);  // 128 threads per block
    dim3 gpu_blocks(n / gpu_threads.x);

    cudaMemsetAsync(d_a, 0, nbytes_per_kernel, stream);

    cudaMemcpyAsync(d_a, a, nbytes_per_kernel, cudaMemcpyHostToDevice, stream);


    kernelAddConstant1<<<gpu_blocks, gpu_threads, 0, stream>>>(d_a, b);

    cudaMemcpyAsync(a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost, stream);
    cudaStreamSynchronize ( stream ) ;
    cudaStreamDestroy(stream);

    //cudaFree(d_a);

    int bResult = correctResult(a, n, b);

    //if (a)
        //cudaFreeHost(a); // free CPU memory

    return bResult;
}

void gpuEnd()
{
    cudaDeviceReset();
}

When I leave cudaFree and cudaFreeHost commented out I achieve the following result:

nVidia Visual Profiler Async nVidia Visual Profiler Async bottom

This is perfect except that I have a memory leak because I'm not using cudaFree and cudaFreeHost. When I do use cudaFree and cudaFreeHost I get the following result:

nVidia Visual Profiler sync top nvidia visual Profiler sync bottom

This is bad. When using cudaFree some streams wait for others to finish first and some streams work asynchronously. I'm assuming this is because cudaFree is not asynchronous which is fine but that doesn't explain why it sometimes works as in the first three kernels called but not at other times? If cudaFree is called but the GPU is already busy doing something else is it possible to have the CPU continue computing and let cudaFree occur automatically the first chance it gets? Is there another way to approach this issue? Thanks for any help you can give!

1

1 Answers

1
votes

Yes, cudaFree is not asynchronous. Niether is cudaMalloc

Do all of your allocations up front before your timing critical code, and do the free operations at the end.

This should be particularly easy in your case, since the size of the allocation is the same each time.

Same comments apply to stream creation. I wouldn't bother creating and destroying them on the fly. Create however many you want, and reuse them until you're done.