Multithreading for image processing at GPU using CUDA

Question

Problem Statement: I have to continuously process 8 megapixel images captured from a camera . There have to be several image processing algorithms on it like color interpolation, color transformation etc. These operations will take a long time at CPU. So, I decided to do these operations at GPU using CUDA kernel. I have already written a working CUDA kernel for color transformation. But still I need some more boost in the performance.

There are basically two computational times:

Copying the source image from CPU to GPU and vice-versa
Processing of the source image at GPU

when the image is getting copied from CPU to GPU....nothing else happens. And similarly, when the processing of image at GPU working...nothing else happens.

MY IDEA: I want to do multi-threading so that I can save some time. I want to capture the next image while the processing of previous image is going on at GPU. And, when the GPU finishes the processing of previous image then, the next image is already there for it to get transferred from CPU to GPU.

What I need: I am completely new to the world of Multi-threading. I am watching some tutorials and some other stuff to know more about it. So, I am looking up for some suggestions about the proper steps and proper logic.

You haven't asked a question here. What exactly is it you would like to see in an answer? (and yes I downvoted, sue me) — talonmies
@talonmies: I have asked politely to let me know the reason of down vote so that I can improve the post. So, there is nothing like "sue me" thing. — skm
Use C++11 threads, google tutorials for either C++11 threads or boost threads (because boost threads are nearly identical). It isn't too hard to learn. Also, if you need to do any processing on the CPU, look into using "intrinsic" code to take advantage of SSE/AVX vector processing operations. — RyanP
What is the frame rate of the incoming image stream? Is your image processing faster than real-time or do you lag behind? — m.s.

Robert Crovella Robert Crovella · Accepted Answer · 2015-07-02T15:40:22

I'm not sure you really need threads for this. CUDA has the ability to allow for asynchronous concurrent execution between host and device (without the necessity to use multiple CPU threads.) What you're asking for is a pretty standard "pipelined" algorithm. It would look something like this:

$ cat t832.cu
#include <stdio.h>

#define IMGSZ 8000000
// for this example, NUM_FRAMES must be less than 255
#define NUM_FRAMES 128
#define nTPB 256
#define nBLK 64


unsigned char cur_frame = 0;
unsigned char validated_frame = 0;


bool validate_image(unsigned char *img) {
  validated_frame++;
  for (int i = 0; i < IMGSZ; i++) if (img[i] != validated_frame) {printf("image validation failed at %d, was: %d, should be: %d\n",i, img[i], validated_frame); return false;}
  return true;
}

void CUDART_CB my_callback(cudaStream_t stream, cudaError_t status, void* data) {
    validate_image((unsigned char *)data);
}


bool capture_image(unsigned char *img){

  for (int i = 0; i < IMGSZ; i++) img[i] = cur_frame;
  if (++cur_frame == NUM_FRAMES) {cur_frame--; return true;}
  return false;
}

__global__ void img_proc_kernel(unsigned char *img){

  int idx = threadIdx.x + blockDim.x*blockIdx.x;
  while(idx < IMGSZ){
    img[idx]++;
    idx += gridDim.x*blockDim.x;}
}

int main(){

  // setup

  bool done = false;
  unsigned char *h_imgA, *h_imgB, *d_imgA, *d_imgB;
  size_t dsize = IMGSZ*sizeof(unsigned char);
  cudaHostAlloc(&h_imgA, dsize, cudaHostAllocDefault);
  cudaHostAlloc(&h_imgB, dsize, cudaHostAllocDefault);
  cudaMalloc(&d_imgA, dsize);
  cudaMalloc(&d_imgB, dsize);
  cudaStream_t st1, st2;
  cudaStreamCreate(&st1); cudaStreamCreate(&st2);
  unsigned char *cur = h_imgA;
  unsigned char *d_cur = d_imgA;
  unsigned char *nxt = h_imgB;
  unsigned char *d_nxt = d_imgB;
  cudaStream_t *curst = &st1;
  cudaStream_t *nxtst = &st2;


  done = capture_image(cur); // grabs a frame and puts it in cur
  // enter main loop
  while (!done){
    cudaMemcpyAsync(d_cur, cur, dsize, cudaMemcpyHostToDevice, *curst); // send frame to device
    img_proc_kernel<<<nBLK, nTPB, 0, *curst>>>(d_cur); // process frame
    cudaMemcpyAsync(cur, d_cur, dsize, cudaMemcpyDeviceToHost, *curst);
  // insert a cuda stream callback here to copy the cur frame to output
    cudaStreamAddCallback(*curst, &my_callback, (void *)cur, 0);
    cudaStreamSynchronize(*nxtst); // prevent overrun
    done = capture_image(nxt); // capture nxt image while GPU is processing cur
    unsigned char *tmp = cur;
    cur = nxt;
    nxt = tmp;   // ping - pong
    tmp = d_cur;
    d_cur = d_nxt;
    d_nxt = tmp;
    cudaStream_t *st_tmp = curst;
    curst = nxtst;
    nxtst = st_tmp;
    }
}
$ nvcc -o t832 t832.cu
$ cuda-memcheck ./t832
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

There are many cuda sample codes which may be helpful also, such as simpleStreams, asyncAPI, and simpleCallbacks

Multithreading for image processing at GPU using CUDA

2 Answers