My application is going to take the rendered results from openGL (both depth map and the rendered 2D image information) to CUDA for processing.
One way I did is to retrieve image/depth map by glReadPixel(..., image_array_HOST/depth_array_Host)*, and then pass image_HOST/depth_HOST to CUDA by cudaMemcpy(..., cudaMemcpyHostToDevice). I have done this part, although it sounds redundant. (from GPU>CPU>GPU). *image_array_HOST/depth_array_Host are array I define on host.
Another way is to use openGL<>cuda interpol. First step is to create one buffer in openGL, and then pass image/depth information to that pixel buffer. Also one cuda token is registered and linked to that buffer. And then link the matrix on CUDA to that cuda token. (as far as I know, seems there is no a direct way to link pixel buffer to cuda matrix, there should be a cudatoken for openGL to recognize. Please, correct me if I ma wrong.)
I have also done this part. It thought it should be fairly efficicent becasue the data CUDA is processing was not transferred to anywhere, but just at where it is located on openGL. It is a data processing inside the device(GPU).
However, the spent time I got from the 2nd method is even (slightly) longerr than the first one (GPU>CPU>GPU). That really confuses me.
I am not sure if I missed any part, or maybe I didn't do it in an efficient way.
One thing I am also not sure is glReadPixel(...,*data). In my understanding, if *data is a pointer linking to memory on HOST, then it will do the data transferring from GPU>CPU. If *data=0, and one buffer is bind, then the data will be transferred to that buffer, and it should be a GPU>GPU thing.
Maybe some other method can pass the data more efficiently then glReadPixel(..,0).
Hope some people can explain my question.
Following is my code:
--
// openGL has finished its rendering, and the data are all save in the openGL. It is ready to go.
...
// declare one pointer and memory location on cuda for later use.
float *depth_map_Device;
cudaMalloc((void**) &depth_map_Device, sizeof(float) * size);
// inititate cuda<>openGL
cudaGLSetGLDevice(0);
// generate a buffer, and link the cuda token to it -- buffer <>cuda token
GLuint gl_pbo;
cudaGraphicsResource_t cudaToken;
size_t data_size = sizeof(float)*number_data; // number_data is defined beforehand
void *data = malloc(data_size);
glGenBuffers(1, &gl_pbo);
glBindBuffer(GL_ARRAY_BUFFER, gl_pbo);
glBufferData(GL_ARRAY_BUFFER, size, data, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
cudaGraphicsGLRegisterBuffer(&cudaToken, gl_pbo, cudaGraphicsMapFlagsNone); // now there is a link between gl_buffer and cudaResource
free(data);
// now it start to map(link) the data on buffer to cuda
glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_pbo);
glReadPixels(0, 0, width, height, GL_RED, GL_FLOAT, 0);
// map the rendered data to buffer, since it is glReadPixels(..,0), it should be still fast? (GPU>GPU)
// width & height are defined beforehand. It can be GL_DEPTH_COMPONENT or others as well, just an example here.
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gl_pbo);
cudaGraphicsMapResources(1, &cudaToken, 0); // let cufaResource which has a link to gl_buffer to the the current CUDA windows
cudaGraphicsResourceGetMappedPointer((void **)&depth_map_Device, &data_size, cudaToken); // transfer data
cudaGraphicsUnmapResources(1, &cudaToken, 0); // unmap it, for the next round
// CUDA kernel
my_kernel <<<block_number, thread_number>>> (...,depth_map_Device,...);