GPU & CPU concurrency: Producer Consumer Bounded Buffer

Question

Consider the following problem:

You have a computing environment with a single gpu and a single cpu. On the gpu, you run a program that performs computations on an array of 1e6 floats. This computation step is repeated n times (process 1). After each computation step I transfer the array from device memory to host memory. Once the transfer is complete, the data is analyzed calling a serial algorithm on the CPU (process 2).

This program works serially. I would like to know how to parallelize processes 1 and 2, to reduce the overall program runtime. It is necessary that process 1 waits for process 2 to finish and vice versa.

I know that CUDA kernels are called asynchronously and I know that there are async copy operations with pinned host memory. However, in this case I need to wait for the GPU to finish before the CPU can start working on that output. How can I pass this info along?

I tried to modify multi-threaded cpu producer/consumer code, but it did not work. I ended up serializing two cpu threads that manage gpu and cpu workload. However, here my GPU waits on the CPU to finish before proceeding...

#include <mutex>
#include <condition_variable>

#include "ProducerConsumerBuffer.hpp"

ProducerConsumerBuffer::ProducerConsumerBuffer(int capacity_in, int n): capacity(capacity_in), count(0) {
    c_bridge = new float[n];
    c_CPU = new float[n];
}

ProducerConsumerBuffer::~ProducerConsumerBuffer(){
    delete[] c_bridge;
    delete[] c_CPU;
}

void ProducerConsumerBuffer::upload(device_pointers *d, params &p, streams *s){
    std::unique_lock<std::mutex> l(lock);

    not_full.wait(l, [this](){return count != 1; });

    copy_GPU_to_CPU(d,c_bridge,p,s);
    count++;

    not_empty.notify_one();
}



void ProducerConsumerBuffer::fetch(){
    std::unique_lock<std::mutex> l(lock);

    not_empty.wait(l, [this](){return count != 0; });

    std::swap(c_bridge,c_CPU);
    count--;

    not_full.notify_one();

}

I was hoping there would be a way to do that with cudastreams. But I think they only work for device function calls. Do I need to use MPI instead or is there another option to sync processes on a heterogeneous computing platform? I read about OpenCL supporting this operation since all computing devices are organized in one "context". Is it not possible to do the same with CUDA?

In case my serialized CPU operation runs 4 times longer than the GPU operation, I was planning to create 4 CPU consumers.

Any insight would be greatly appreciated!

EDIT: CPU function contains serial code, that is not parallelizable.

just a side note: why do you need to do the "data analyzing" on the host? if you could perform it on the device, depending on the output of this analyzing step, you could save memory bandwidth ... — m.s.
To get device concurrency between the CPU and GPU, the usual idiom is to double buffer: have the CPU and GPU operate on 2 different buffers, then switch the sense of the buffers when both devices are done. The workload you're describing sounds like it would need 4 buffers and for 4 CPU threads to do the CPU processing. The goal is to have each of the 2 devices (CPU and GPU) spend equal amounts of time processing, otherwise one or the other is wasting time waiting. The pageable memcpy samples here should help. github.com/ArchaeaSoftware/cudahandbook/tree/master/concurrency — ArchaeaSoftware
@oscillon please make sure to mark an answer & upvote so we can close this out — Jason Newton

Jason Newton Jason Newton · Accepted Answer · 2015-04-27T22:23:00

There is no way to do what you want without the use of multiple threads or processes or significantly complicating your CPU algorithm invasively to achieve a tolerable scheduling latency. This is because you must be able to command the GPU at the right frequency with low latency to process the data you have for the GPU workload but the CPU workload does not sound insignificant and has to be factored in to the run-time of the loop.

Because of this, to make sure both CPU and GPU are continuously processing and achieving the highest throughput & lowest latencies, you must break the GPU commanding portion and expensive CPU computation portion into different threads - and between the 2 is some sort of IPC - preferably shared memory. You might be able to simplify some tasks if the dedicated CPU processing thread is worked with in a similar style to CUDA and using it's cudaEvent_t's across threads and make the GPU commanding thread also command the CPU thread - that is 1 command thread and 2 processing slaves (GPU, CPU).

GPU & CPU concurrency: Producer Consumer Bounded Buffer

1 Answers