Consider the following problem:
You have a computing environment with a single gpu and a single cpu. On the gpu, you run a program that performs computations on an array of 1e6 floats. This computation step is repeated n times (process 1). After each computation step I transfer the array from device memory to host memory. Once the transfer is complete, the data is analyzed calling a serial algorithm on the CPU (process 2).
This program works serially. I would like to know how to parallelize processes 1 and 2, to reduce the overall program runtime. It is necessary that process 1 waits for process 2 to finish and vice versa.
I know that CUDA kernels are called asynchronously and I know that there are async copy operations with pinned host memory. However, in this case I need to wait for the GPU to finish before the CPU can start working on that output. How can I pass this info along?
I tried to modify multi-threaded cpu producer/consumer code, but it did not work. I ended up serializing two cpu threads that manage gpu and cpu workload. However, here my GPU waits on the CPU to finish before proceeding...
#include <mutex>
#include <condition_variable>
#include "ProducerConsumerBuffer.hpp"
ProducerConsumerBuffer::ProducerConsumerBuffer(int capacity_in, int n): capacity(capacity_in), count(0) {
c_bridge = new float[n];
c_CPU = new float[n];
}
ProducerConsumerBuffer::~ProducerConsumerBuffer(){
delete[] c_bridge;
delete[] c_CPU;
}
void ProducerConsumerBuffer::upload(device_pointers *d, params &p, streams *s){
std::unique_lock<std::mutex> l(lock);
not_full.wait(l, [this](){return count != 1; });
copy_GPU_to_CPU(d,c_bridge,p,s);
count++;
not_empty.notify_one();
}
void ProducerConsumerBuffer::fetch(){
std::unique_lock<std::mutex> l(lock);
not_empty.wait(l, [this](){return count != 0; });
std::swap(c_bridge,c_CPU);
count--;
not_full.notify_one();
}
I was hoping there would be a way to do that with cudastreams. But I think they only work for device function calls. Do I need to use MPI instead or is there another option to sync processes on a heterogeneous computing platform? I read about OpenCL supporting this operation since all computing devices are organized in one "context". Is it not possible to do the same with CUDA?
In case my serialized CPU operation runs 4 times longer than the GPU operation, I was planning to create 4 CPU consumers.
Any insight would be greatly appreciated!
EDIT: CPU function contains serial code, that is not parallelizable.