I'm currently implementing a motion tracking algorithm on my GPU (CUDA/C++), and am seeing very strong speed-ups so far. As one can probably expect, however, the main bottleneck is the actual transferring of frame (image) data from the CPU to the GPU.
As is, I'm using OpenCV to read in a test video file. OpenCV, however, returns the frames as packed bytes in the form RRGGBB RRGGBB ...
, or in other terms, each pixel is aligned to 24-bit boundaries. This disallows me from using coalesced memory accesses, which causes severe performance penalties on the GPU. As-is, I'm just using some pre-generated test data which is 32-bit aligned (padded with zeros in the form RRGGBB00 RRGGBB00 ...
), but I'd like to start using actual video data now.
This is causing me some significant performance penalties, so I have two main questions:
Although I can pre-process the pixels of interest on the CPU manually and then initiate a transfer, is there any method which can quickly transfer the pixel data to the GPU, but instead aligned to 32 bit boundaries? (I would assume this has the same performance hit as pre-processing, however)
Is there another library I can use to read in the video in a different format? For example, I know SDL surfaces are packed in 32-bit boundaries, even without the inclusion of an alpha channel.
The end goal of our implementation would be to interface in real-time with a camera for robotic control, although for now I just want something that can efficiently decode my test video to test our feature detection and motion tracking algorithms with pre-defined test data.