3
votes

I'm currently implementing a motion tracking algorithm on my GPU (CUDA/C++), and am seeing very strong speed-ups so far. As one can probably expect, however, the main bottleneck is the actual transferring of frame (image) data from the CPU to the GPU.

As is, I'm using OpenCV to read in a test video file. OpenCV, however, returns the frames as packed bytes in the form RRGGBB RRGGBB ..., or in other terms, each pixel is aligned to 24-bit boundaries. This disallows me from using coalesced memory accesses, which causes severe performance penalties on the GPU. As-is, I'm just using some pre-generated test data which is 32-bit aligned (padded with zeros in the form RRGGBB00 RRGGBB00 ...), but I'd like to start using actual video data now.

This is causing me some significant performance penalties, so I have two main questions:

  1. Although I can pre-process the pixels of interest on the CPU manually and then initiate a transfer, is there any method which can quickly transfer the pixel data to the GPU, but instead aligned to 32 bit boundaries? (I would assume this has the same performance hit as pre-processing, however)

  2. Is there another library I can use to read in the video in a different format? For example, I know SDL surfaces are packed in 32-bit boundaries, even without the inclusion of an alpha channel.

The end goal of our implementation would be to interface in real-time with a camera for robotic control, although for now I just want something that can efficiently decode my test video to test our feature detection and motion tracking algorithms with pre-defined test data.

1
What format are your video files? You could try loading the video with something else, like FFMpeg or the CUDA Decoder Library (I think OpenCV has FFMpeg bindings as of 2.4 too?).alrikai
@alrikai right now, it's just a video compressed using x264 (so h.264 in a .MKV container). I looked into the CUDA decoder library, but I wasn't sure if it was compatible with Linux as well, and I'd like for the solution to be expanded to when we use the system with an actual camera (and thus can't use the CUDA decoder anyways). In terms of OpenCV, I believe it already does use FFMPEG on Linux, but not on Windows (see the documentation here).Breakthrough
@Breakthrough The CUDA decoder works on both Linux and Windows, although the parts of its API for parsing and stream inputs are Windows-only (so you'll have to write your own on Linux). Also, it is possible to use it with a video input stream, so it should be able to work with an actual camera. On that note, it's a lot more work to set up than just using something like FFMpeg/avutils. However, it would probably be the most efficient, as the decoded frame will already be in device memory, and generally speaking should be faster than CPU-based decodingalrikai
@alrikai good to know, thank you. I'll look into FFMPEG a bit more, but for now I think I'm just going to stick with OpenCV and the fast 24-bit -> 32-bit CPU conversion MSalters outlined above. The hardware for the end system is a long ways out, so if we end up using a camera with a compressed data stream, I'd have to agree that the CUDA video decoder would probably be best.Breakthrough

1 Answers

2
votes

I tried writing a simple CUDA kernel that pads 24-bit values to 32-bit using the shared memory. Note that this is not a very tidy code (only works with 1 block, relies on int to be 32 bit) - use with care. I tried both a version with shared memory atomics and without - seems to be working.:

__global__ void pad(unsigned int *data, unsigned int* odata) {
__shared__ unsigned int array[WORK_SIZE];
unsigned int v, high, low;
const int index = (threadIdx.x * sizeof(unsigned int)) / 3;

array[threadIdx.x] = 0;
__syncthreads();

const int shl = threadIdx.x % 3;
const int shr = 3 - shl;

if (threadIdx.x
        < ((WORK_SIZE * 3) + sizeof(unsigned int) - 1)
                / sizeof(unsigned int)) {
    v = data[threadIdx.x];
    high = (v >> (shl * 8)) & ~0xFF;
    low = v << (shr * 8);
#if __CUDA_ARCH__ < 200
    array[index] = high;
}
__syncthreads();
if (threadIdx.x
        < ((WORK_SIZE * 3) + sizeof(unsigned int) - 1)
        / sizeof(unsigned int)) {
    array[index + 1] += low;
#else
    if (high)
        atomicOr(array + index, high);
    if (low)
        atomicOr(array + 1 + index, low);
#endif
}
__syncthreads();

// Do computations!
odata[threadIdx.x] = array[threadIdx.x] + 0xFF;
}