Image processing on the GPU with OpenGL, GLSL and Framebuffer Objects - questions about performance

Question

I was included in a project, which does image processing on the CPU and is currently being extended to use the GPU as well, with the hopes being to use mainly the GPU, if that proves to be faster, and have the CPU processing part as a fall-back. I am new to GPU programming and have a few questions, aspects of which I have seen discussed in other threads, but haven’t been able to find the answers I need.

If we were starting from scratch, what technology would you recommend for image processing on the GPU, in order to achieve the optimum combination of coverage (as in support on client machines) and speed? We've gone down the OpenGL + GLSL way as a way of covering as many graphics cards as possible and I am curious whether this is the optimal choice. What would you say about OpenCL, for example?
Given we have already started implementing the GPU module with OpenGL and shaders, I would like to get an idea of whether we are doing that the most efficient way.

We use Framebuffer Objects to read from and to render to textures. In most cases the area that is being read and the area that is being written to are the same size, but the textures we read from and write to could be an arbitrary size. In other words, we ask the FBO to read a subarea of what is considered to be its input texture and write to a subarea of what is considered to be its output texture. For that purpose the output texture is "attached" to the Framebuffer Object (with glFramebufferTexture2DEXT()), but the input one is not. This requires textures to be "attached" and "detached", as they change their roles (i.e. a texture could be initially used for writing to, but in the next pass it could be used as an input to read from).

Would, instead of that, forcing the inputs and outputs to be the same size and always having them attached to the FBO make more sense, in terms of using the FBO efficiently and achieving better performance or does what we already do sound good enough?
The project was initially designed to render on the CPU, so care was taken for requests to be made to render as fewer pixels as possible at a time. So, whenever a mouse move happens, for example, only a very small area around the cursor would be re-rendered. Or, when rendering a whole image that covers the screen, it might be chopped into strips to be rendered and displayed one after the other. Does such fragmentation make sense, when rendering on the GPU? What would be the best way to determine the optimum size for a render request (i.e. an output texture), so that the GPU is fully utilised?
What considerations would there be when profiling code (for performance), that runs on the GPU? (To compare it with rendering on the CPU.) Does measuring how long calls take to return (and calling glFinish() to ensure commands have completed on the GPU) sound useful or is there anything else to keep in mind?

Thank you very much!

I think I need to add a couple of details to clarify my questions:

2) We aren't actually using the same texture as a rendering target and reading source at the same time. It's only when rendering has finished that an "output" texture becomes "input" - i.e. when the result of a render job needs to be read for another pass or as an input for another filter.

What I was concerned with was whether attached textures are treated differently, as in whether the FBO or shader would have faster access to them, compared with when they aren't attached.

My initial (though probably not totally accurate) profiling didn't show dramatic differences, so I guess we aren't committing that much of a performance crime. I'll do more tests with the timing functions you suggested - these look useful.

3) I was wondering whether chopping a picture into tiny pieces (say as small as 100 x 100 pixels for a mouse move) and requesting them to be rendered one by one would be slower or faster (or whether it wouldn't matter) on a GPU, which could potentially paralellise a lot of the work. My gut feeling is that that might be overzealous optimisation that, in the best case, won't buy us much and in the worst, might hurt performance, so was wondering whether there is a formal way of telling for a particular implementation. In the end, I guess we'd go with what seems reasonable across various graphics cards.

Christian Rau Christian Rau · Accepted Answer · 2011-05-31T12:42:20

I don't have too much insight into your project, but I'll try to provide some simple answers, perhaps others can be more detailed:

As long as you do the usual modify-output-pixels-using-some-input-pixels tasks from image processing without much synchronization, you should be fine with the usual screen-sized-quad-with-fragment-shader approach (sorry for these strange phrases). And you get image filtering (like bilinear interpolation) for free (I don't know if CUDA or OpenCL support image filtering, although they should, as the hardware is there anyway).
You cannot read from a texture that is used as a render target anyway (although they may still be attached I think), so your current approach should be fine. Requiring them to be same size only to let them attached to the FBO would limit the flexibility very much for quite nothing (I think the attaching cost is negligable).
The optimal size is really implementation dependent, but limiting the rendered range and therefore the fragment shader invocations should always be a good idea, as long as these limiting computations don't last too long (simple bounding boxes with glScissor are your friend, I think, or just using a smaller than screen size quad).
There are other perhaps much more accurate methods for timing the GPU (look at the GL_ARB_timer_query extension, for example). For profiling and debugging you can use general GPU profilers and debuggers, as gDEBugger and the like, I think. Although I don't have much experience with such tools.

EDIT: To you edited questions:

I really doubt, that an attached texture is read faster than a non-attached one. The only thing you would gain is that you don't need to reattach it when you want to write into it, but as I said, that cost should be negligable, if any.
I would not over optimize that by tiling it into too small pieces. Like I said, when working with GL, you can use the scissor test and the stencil test for such things. But it all has to be tested, I think, to be sure of the performance gain. I don't know, what you mean with mouse move, as when you just move the mouse over your window, the window system usually takes care of rendering the cursor as an overlay so you need not redraw the underlying image again, as it is buffered by the window system, I think.

Image processing on the GPU with OpenGL, GLSL and Framebuffer Objects - questions about performance

1 Answers