Cuda unified memory between gpu and host

Question

I'm writing a cuda-based program that needs to periodically transfer a set of items from the GPU to the Host memory. In order to keep the process asynchronous, I was hoping to use cuda's UMA to have a memory buffer and flag in the host memory (so both the GPU and the CPU can access it). The GPU would make sure the flag is clear, add its items to the buffer, and set the flag. The CPU waits for the flag to be set, copies things out of the buffer, and clears the flag. As far as I can see, this doesn't produce any race condition because it forces the GPU and CPU to take turns, always reading and writing to the flag opposite each other.

So far I haven't been able to get this to work because there does seem to be some sort of race condition. I came up with a simpler example that has a similar issue:

#include <stdio.h>

__global__
void uva_counting_test(int n, int *h_i);

int main() {
    int *h_i;
    int n;

    cudaMallocHost(&h_i, sizeof(int));

    *h_i = 0;
    n = 2;

    uva_counting_test<<<1, 1>>>(n, h_i);

    //even numbers
    for(int i = 1; i <= n; ++i) {
        //wait for a change to odd from gpu
        while(*h_i == (2*(i - 1)));

        printf("host h_i: %d\n", *h_i);
        *h_i = 2*i;
    }

    return 0;
}

__global__
void uva_counting_test(int n, int *h_i) {
    //odd numbers
    for(int i = 0; i < n; ++i) {
        //wait for a change to even from host
        while(*h_i == (2*(i - 1) + 1));

        *h_i = 2*i + 1;
    }
}

For me, this case always hangs after the first print statement from the CPU (host h_i: 1). The really unusual thing (which may be a clue) is that I can get it to work in cuda-gdb. If I run it in cuda-gdb, it will hang as before. If I press ctrl+C, it will bring me to the while() loop line in the kernel. From there, surprisingly, I can tell it to continue and it will finish. For n > 2, it will freeze on the while() loop in the kernel again after each kernel, but I can keep pushing it forward with ctrl+C and continue.

If there's a better way to accomplish what I'm trying to do, that would also be helpful.

Nothing in your code guarantees cache coherence. Without memory fences of some kind this approach cannot work. Consider instead to launch a kernel every time, which is fairly cheap when compared to unified memory accesses anyway. — gha.st
Your example code doesn't work because there is no guarantee of memory coherence across the PCI-e bus during kernel execution. The basic rule of this game is don't try and design any sort of execution model that relies on anything other than explicit, host driver level synchronization between GPU and host device. — talonmies
You are not using Unified Memory.. You are using zero-copy host memory. If you just want to see a counting test that works, take a look here. In addition to all the other comments about your approach, today's implementation of unified memory is not designed to provide simultaneous coherent access to a memory region for both the host and a currently executing kernel. — Robert Crovella

Tom Tom · Accepted Answer · 2014-05-02T08:57:40

You are describing a producer-consumer model, where the GPU is producing some data and from time-to-time the CPU will consume that data.

The simplest way to implement this is to have the CPU be the master. The CPU launches a kernel on the GPU, when it is ready to ready to consume data (i.e. the while loop in your example) it synchronises with the GPU, copies the data back from the GPU, launches the kernel again to generate more data, and does whatever it has to do with the data it copied. This allows you to have the GPU filling a fixed-size buffer while the CPU is processing the previous batch (since there are two copies, one on the GPU and one on the CPU).

That can be improved upon by double-buffering the data, meaning that you can keep the GPU busy producing data 100% of the time by ping-ponging between buffers as you copy the other to the CPU. That assumes the copy-back is faster than the production, but if not then you will saturate the copy bandwidth which is also good.

Neither of those are what you actually described. What you asked for is to have the GPU master the data. I'd urge caution on that since you will need to manage your buffer size carefully and you will need to think carefully about the timings and communication issues. It's certainly possible to do something like that but before you explore that direction you should read up about memory fences, atomic operations, and volatile.

Cuda unified memory between gpu and host

2 Answers