0
votes

Problem Statement: I have to continuously process 8 megapixel images captured from a camera . There have to be several image processing algorithms on it like color interpolation, color transformation etc. These operations will take a long time at CPU. So, I decided to do these operations at GPU using CUDA kernel. I have already written a working CUDA kernel for color transformation. But still I need some more boost in the performance.

There are basically two computational times:

  1. Copying the source image from CPU to GPU and vice-versa
  2. Processing of the source image at GPU

when the image is getting copied from CPU to GPU....nothing else happens. And similarly, when the processing of image at GPU working...nothing else happens.

MY IDEA: I want to do multi-threading so that I can save some time. I want to capture the next image while the processing of previous image is going on at GPU. And, when the GPU finishes the processing of previous image then, the next image is already there for it to get transferred from CPU to GPU.

What I need: I am completely new to the world of Multi-threading. I am watching some tutorials and some other stuff to know more about it. So, I am looking up for some suggestions about the proper steps and proper logic.

2
You haven't asked a question here. What exactly is it you would like to see in an answer? (and yes I downvoted, sue me)talonmies
@talonmies: I have asked politely to let me know the reason of down vote so that I can improve the post. So, there is nothing like "sue me" thing.skm
Use C++11 threads, google tutorials for either C++11 threads or boost threads (because boost threads are nearly identical). It isn't too hard to learn. Also, if you need to do any processing on the CPU, look into using "intrinsic" code to take advantage of SSE/AVX vector processing operations.RyanP
What is the frame rate of the incoming image stream? Is your image processing faster than real-time or do you lag behind?m.s.
@m.s. : I want to maintain 30 fps.skm

2 Answers

8
votes

I'm not sure you really need threads for this. CUDA has the ability to allow for asynchronous concurrent execution between host and device (without the necessity to use multiple CPU threads.) What you're asking for is a pretty standard "pipelined" algorithm. It would look something like this:

$ cat t832.cu
#include <stdio.h>

#define IMGSZ 8000000
// for this example, NUM_FRAMES must be less than 255
#define NUM_FRAMES 128
#define nTPB 256
#define nBLK 64


unsigned char cur_frame = 0;
unsigned char validated_frame = 0;


bool validate_image(unsigned char *img) {
  validated_frame++;
  for (int i = 0; i < IMGSZ; i++) if (img[i] != validated_frame) {printf("image validation failed at %d, was: %d, should be: %d\n",i, img[i], validated_frame); return false;}
  return true;
}

void CUDART_CB my_callback(cudaStream_t stream, cudaError_t status, void* data) {
    validate_image((unsigned char *)data);
}


bool capture_image(unsigned char *img){

  for (int i = 0; i < IMGSZ; i++) img[i] = cur_frame;
  if (++cur_frame == NUM_FRAMES) {cur_frame--; return true;}
  return false;
}

__global__ void img_proc_kernel(unsigned char *img){

  int idx = threadIdx.x + blockDim.x*blockIdx.x;
  while(idx < IMGSZ){
    img[idx]++;
    idx += gridDim.x*blockDim.x;}
}

int main(){

  // setup

  bool done = false;
  unsigned char *h_imgA, *h_imgB, *d_imgA, *d_imgB;
  size_t dsize = IMGSZ*sizeof(unsigned char);
  cudaHostAlloc(&h_imgA, dsize, cudaHostAllocDefault);
  cudaHostAlloc(&h_imgB, dsize, cudaHostAllocDefault);
  cudaMalloc(&d_imgA, dsize);
  cudaMalloc(&d_imgB, dsize);
  cudaStream_t st1, st2;
  cudaStreamCreate(&st1); cudaStreamCreate(&st2);
  unsigned char *cur = h_imgA;
  unsigned char *d_cur = d_imgA;
  unsigned char *nxt = h_imgB;
  unsigned char *d_nxt = d_imgB;
  cudaStream_t *curst = &st1;
  cudaStream_t *nxtst = &st2;


  done = capture_image(cur); // grabs a frame and puts it in cur
  // enter main loop
  while (!done){
    cudaMemcpyAsync(d_cur, cur, dsize, cudaMemcpyHostToDevice, *curst); // send frame to device
    img_proc_kernel<<<nBLK, nTPB, 0, *curst>>>(d_cur); // process frame
    cudaMemcpyAsync(cur, d_cur, dsize, cudaMemcpyDeviceToHost, *curst);
  // insert a cuda stream callback here to copy the cur frame to output
    cudaStreamAddCallback(*curst, &my_callback, (void *)cur, 0);
    cudaStreamSynchronize(*nxtst); // prevent overrun
    done = capture_image(nxt); // capture nxt image while GPU is processing cur
    unsigned char *tmp = cur;
    cur = nxt;
    nxt = tmp;   // ping - pong
    tmp = d_cur;
    d_cur = d_nxt;
    d_nxt = tmp;
    cudaStream_t *st_tmp = curst;
    curst = nxtst;
    nxtst = st_tmp;
    }
}
$ nvcc -o t832 t832.cu
$ cuda-memcheck ./t832
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

There are many cuda sample codes which may be helpful also, such as simpleStreams, asyncAPI, and simpleCallbacks

4
votes

Since your question is very wide, I can only think of the following advice:

1) Use CUDA streams

When using more than one CUDA stream, the memory transfer between CPU->GPU, the GPU processing and the memory transfer between GPU->CPU can overlap. This way the image processing of the next image can already begin while the result is transferred back.

You can also decompose each frame. Use n streams per frame and launch the image processing kernels n times with an offset.

2) Apply the producer-consumer scheme

The producer thread captures the frames from the camera and stores them in a thread-safe container. The consumer thread(s) fetch(es) a frame from this source container, upload(s) it to the GPU using its/their own CUDA stream(s), launches the kernel and copies the result back to the host. Each consumer thread would synchronize with its stream(s) before trying to get a new image from the source container.

A simple implementation could look like this:

#include <vector>
#include <thread>
#include <memory>

struct ThreadSafeContainer{ /*...*/ };

struct Producer
{
    Producer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
    {

    }

    void run()
    {
        while(true)
        {
            // grab image from camera
            // store image in container
        }
    }

    std::shared_ptr<ThreadSafeContainer> container;
};

struct Consumer
{
    Consumer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
    {
        cudaStreamCreate(&stream);
    }
    ~Consumer()
    {
        cudaStreamDestroy(stream);
    }

    void run()
    {
        while(true)
        {
            // read next image from container

            // upload to GPU
            cudaMemcpyAsync(...,...,...,stream);
            // run kernel
            kernel<<<..., ..., ..., stream>>>(...);
            // copy results back
            cudaMemcpyAsync(...,...,...,stream);

            // wait for results 
            cudaStreamSynchronize(stream);

            // do something with the results
        }
    }

    std::shared_ptr<ThreadSafeContainer> container;
    cudaStream_t stream; // or multiple streams per consumer
};


int main()
{
    // create an instance of ThreadSafeContainer which whill be shared between Producer and Consumer instances 
    auto container = std::make_shared<ThreadSafeContainer>();

    // create one instance of Producer, pass the shared container as an argument to the constructor
    auto p = std::make_shared<Producer>(container);
    // create a separate thread which executes Producer::run  
    std::thread producer_thread(&Producer::run, p);

    const int consumer_count = 2;
    std::vector<std::thread> consumer_threads;
    std::vector<std::shared_ptr<Consumer>> consumers;

    // create as many consumers as specified
    for (int i=0; i<consumer_count;++i)
    {
        // create one instance of Consumer, pass the shared container as an argument to the constructor
        auto c = std::make_shared<Consumer>(container);
        // create a separate thread which executes Consumer::run
        consumer_threads.push_back(std::thread(&Consumer::run, c));
    }

    // wait for the threads to finish, otherwise the program will just exit here and the threads will be killed
    // in this example, the program will never exit since the infinite loop in the run() methods never end
    producer_thread.join();
    for (auto& t : consumer_threads)
    {
        t.join();
    }

    return 0;
}