CUDA OPENGL Interoperability: slow mapping

Question

My application is going to take the rendered results from openGL (both depth map and the rendered 2D image information) to CUDA for processing.

One way I did is to retrieve image/depth map by glReadPixel(..., image_array_HOST/depth_array_Host)*, and then pass image_HOST/depth_HOST to CUDA by cudaMemcpy(..., cudaMemcpyHostToDevice). I have done this part, although it sounds redundant. (from GPU>CPU>GPU). *image_array_HOST/depth_array_Host are array I define on host.

Another way is to use openGL<>cuda interpol. First step is to create one buffer in openGL, and then pass image/depth information to that pixel buffer. Also one cuda token is registered and linked to that buffer. And then link the matrix on CUDA to that cuda token. (as far as I know, seems there is no a direct way to link pixel buffer to cuda matrix, there should be a cudatoken for openGL to recognize. Please, correct me if I ma wrong.)

I have also done this part. It thought it should be fairly efficicent becasue the data CUDA is processing was not transferred to anywhere, but just at where it is located on openGL. It is a data processing inside the device(GPU).

However, the spent time I got from the 2nd method is even (slightly) longerr than the first one (GPU>CPU>GPU). That really confuses me.

I am not sure if I missed any part, or maybe I didn't do it in an efficient way.

One thing I am also not sure is glReadPixel(...,*data). In my understanding, if *data is a pointer linking to memory on HOST, then it will do the data transferring from GPU>CPU. If *data=0, and one buffer is bind, then the data will be transferred to that buffer, and it should be a GPU>GPU thing.

Maybe some other method can pass the data more efficiently then glReadPixel(..,0).

Hope some people can explain my question.

Following is my code:

--

// openGL has finished its rendering, and the data are all save in the openGL. It is ready to go.
... 


// declare one pointer and memory location on cuda for later use.
float *depth_map_Device;
cudaMalloc((void**) &depth_map_Device, sizeof(float) * size); 


// inititate cuda<>openGL
cudaGLSetGLDevice(0);   


// generate a buffer, and link the cuda token to it -- buffer <>cuda token
GLuint gl_pbo;
cudaGraphicsResource_t cudaToken;   
size_t data_size = sizeof(float)*number_data;                               // number_data is defined beforehand
void *data = malloc(data_size);
glGenBuffers(1, &gl_pbo);
glBindBuffer(GL_ARRAY_BUFFER, gl_pbo);
glBufferData(GL_ARRAY_BUFFER, size, data, GL_DYNAMIC_DRAW); 
glBindBuffer(GL_ARRAY_BUFFER, 0);
cudaGraphicsGLRegisterBuffer(&cudaToken, gl_pbo, cudaGraphicsMapFlagsNone); // now there is a link between gl_buffer and cudaResource
free(data);

// now it start to map(link) the data on buffer to cuda 
glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_pbo);                     
glReadPixels(0, 0, width, height, GL_RED, GL_FLOAT, 0);         
// map the rendered data to buffer, since it is glReadPixels(..,0), it should be still fast? (GPU>GPU)
// width & height are defined beforehand. It can be GL_DEPTH_COMPONENT or others as well, just an example here.
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gl_pbo);                       
cudaGraphicsMapResources(1, &cudaToken, 0);                     // let cufaResource which has a link to gl_buffer to the the current CUDA windows
cudaGraphicsResourceGetMappedPointer((void **)&depth_map_Device,  &data_size, cudaToken);   // transfer data
cudaGraphicsUnmapResources(1, &cudaToken, 0);           // unmap it, for the next round

// CUDA kernel
my_kernel       <<<block_number, thread_number>>> (...,depth_map_Device,...);

Since no one has replied , I comments here again, and hope some people might see it. Many thanks — user2274367

user2274367 user2274367 · Accepted Answer · 2013-04-25T15:06:38

I think I can answer my question partly now, and hope it is useful for some people.

I was binding pbo to a float cuda (GPU) memory, but seems the openGL raw image rendered data is unsigned char format, (following is my supposition) so this data need to be transformed to float and then pass to cuda memory. I think what openGL did is using CPU to do this format transformation, and that is why there is no big difference between with and without using pbo.

By using unsigned char (glreadpixel(..,GL_UNSIGNED_BYTE,0)), binding with pbo is quicker than without using pbo for reading RGB data. And then I pass it do a simple cuda kernel to do the format transformation, which is more efficient than what openGL did. By doing this the speed is much quicker.

However, it doesnt work for depth buffer. For some reason, reading depth map by glreadpixel (no matter with/without pbo) is slow. And then, I found two old discussions: http://www.opengl.org/discussion_boards/showthread.php/153121-Reading-the-Depth-Buffer-Why-so-slow

http://www.opengl.org/discussion_boards/showthread.php/173205-Saving-Restoring-Depth-Buffer-to-from-PBO

They pointed out the format question, and that is exactly what I found for RGB. (unsigned char). But I have tried unsigned char/unsigned short and unsigned int, and float for reading depth buffer, all performance almost the same speed.

So I still have speed problem for reading depth.

CUDA OPENGL Interoperability: slow mapping

1 Answers