I was included in a project, which does image processing on the CPU and is currently being extended to use the GPU as well, with the hopes being to use mainly the GPU, if that proves to be faster, and have the CPU processing part as a fall-back. I am new to GPU programming and have a few questions, aspects of which I have seen discussed in other threads, but haven’t been able to find the answers I need.
If we were starting from scratch, what technology would you recommend for image processing on the GPU, in order to achieve the optimum combination of coverage (as in support on client machines) and speed? We've gone down the OpenGL + GLSL way as a way of covering as many graphics cards as possible and I am curious whether this is the optimal choice. What would you say about OpenCL, for example?
Given we have already started implementing the GPU module with OpenGL and shaders, I would like to get an idea of whether we are doing that the most efficient way.
We use Framebuffer Objects to read from and to render to textures. In most cases the area that is being read and the area that is being written to are the same size, but the textures we read from and write to could be an arbitrary size. In other words, we ask the FBO to read a subarea of what is considered to be its input texture and write to a subarea of what is considered to be its output texture. For that purpose the output texture is "attached" to the Framebuffer Object (with glFramebufferTexture2DEXT()), but the input one is not. This requires textures to be "attached" and "detached", as they change their roles (i.e. a texture could be initially used for writing to, but in the next pass it could be used as an input to read from).
Would, instead of that, forcing the inputs and outputs to be the same size and always having them attached to the FBO make more sense, in terms of using the FBO efficiently and achieving better performance or does what we already do sound good enough?
The project was initially designed to render on the CPU, so care was taken for requests to be made to render as fewer pixels as possible at a time. So, whenever a mouse move happens, for example, only a very small area around the cursor would be re-rendered. Or, when rendering a whole image that covers the screen, it might be chopped into strips to be rendered and displayed one after the other. Does such fragmentation make sense, when rendering on the GPU? What would be the best way to determine the optimum size for a render request (i.e. an output texture), so that the GPU is fully utilised?
What considerations would there be when profiling code (for performance), that runs on the GPU? (To compare it with rendering on the CPU.) Does measuring how long calls take to return (and calling glFinish() to ensure commands have completed on the GPU) sound useful or is there anything else to keep in mind?
Thank you very much!
I think I need to add a couple of details to clarify my questions:
2) We aren't actually using the same texture as a rendering target and reading source at the same time. It's only when rendering has finished that an "output" texture becomes "input" - i.e. when the result of a render job needs to be read for another pass or as an input for another filter.
What I was concerned with was whether attached textures are treated differently, as in whether the FBO or shader would have faster access to them, compared with when they aren't attached.
My initial (though probably not totally accurate) profiling didn't show dramatic differences, so I guess we aren't committing that much of a performance crime. I'll do more tests with the timing functions you suggested - these look useful.
3) I was wondering whether chopping a picture into tiny pieces (say as small as 100 x 100 pixels for a mouse move) and requesting them to be rendered one by one would be slower or faster (or whether it wouldn't matter) on a GPU, which could potentially paralellise a lot of the work. My gut feeling is that that might be overzealous optimisation that, in the best case, won't buy us much and in the worst, might hurt performance, so was wondering whether there is a formal way of telling for a particular implementation. In the end, I guess we'd go with what seems reasonable across various graphics cards.