0
votes

My program analyzes a video file, which is represented as a 3d array and sent from LabView to my program. LabView already flattens this 3d array into a 1d array, so I have just been allocating a 1d array in CUDA, using cudaMalloc, and using cudaMemcpy to copy the data over. However i noticed if I am sending more than 2XXX, 120x240 pixle images, that I am getting an "unknown error" from some of my cuda memory functions (cudamemcpy and cudafree, which occur later in my program after a few kernels are called) and these ultimately break my program. However, if I lower the number if images i am sending I don't have a problem. This leads me to believe that my code is fine, but my memory allocation practices are bad.

To start, lets talk about Pitched memory. As far as I am aware, this is is all about picking a good size to allocate memory such that linear data is not split over two chunks. This is especially common for 2d and 3d arrays since you would want to keep rows or columns together in memory for fast access.

Could these kinds of problems occur if I don't use pitched memory? What kinds of errors can occur when not using pitched memory, especially for these very large arrays? I up to this point have ignored the option of using cudaMallocPitch and cudaMalloc3d although I do technically have 2d and 3d arrays, which I have flattened.

Finally, how can I further debug problems with my code when cudaGetLastError only tells me "unknown error"? I am able to find which function is at fault, but when it is somthign like cudaFree, there is no way for me to debug this kind of stuff, or find out where the problem is originating.

Anyway, thanks for the help.

1

1 Answers

2
votes

The cost of not using pitched memory is speed. If two threads try to access adjacent frames of your video and the frames are allocated in contiguous memory (with no alignment), parts of one frame will reside in the same memory block or cache line as the other frame, and one thread may have to wait until the other thread has finished with its memory operation. Probably not fatal, but definitely not optimal. There might also be read-after-write or write-after-write issues in there.

The cost of using pitched memory is that it will increase memory allocation slightly if your element (frame or scanline) size is not an even multiple of the preferred alignment boundary. The start of the next frame or scanline may have to be padded out a few bytes to make it start on an appropriate memory address boundary. Adding 30 bytes to each frame or scanline size to get the next to an appropriate boundary for 2000 frames will add about 60,000 bytes to your total memory allocation.

If the total data set does not fit in device memory, you will have to break your dataset up into smaller chunks and make multiple calls to your cuda kernel to process each chunk. If your code doesn't need random access to the entire data set all the time, switching to a streaming model could drastically reduce your overall processing time. While one warp is waiting for its data chunk to load into device memory, another warp can be processing its chunk, so the CUDA cores don't sit idle.

If your video processing code needs to see, say, 4 consecutive frame buffers to do its work, then you can work out a buffer management system that retires the oldest frame from the queue when it is no longer needed and sets up a new frame in preparation for the next kernel call. Even better - recycle the old frame memory for the new frame, to avoid the overhead of memory allocation.

Load only what you need, when (or just before) you actually need it. This is how $20 video player and recorder chips handle multi-gigabyte video streams in real time with a pittance of actual RAM.