Consider the typical "naive" vertex shader:
in vec3 aPos;
uniform mat4 uMatCam;
uniform mat4 uMatModelView;
uniform mat4 uMatProj;
void main () {
gl_Position = uMatProj * uMatCam * uMatModelView * vec4(aPos, 1.0);
}
Of course, conventional wisdom would suggest "there are three mat4s multiplied for each vertex, two of which are uniform even across multiple subsequent glDrawX() calls within the current shader program, at least these two should be pre-multiplied CPU-side, possibly even all three."
I'm wondering whether modern-day GPUs have optimized this use-case to a degree where CPU-side premultiplication is no longer a performance benefit. Of course, the purist might say "it depends on the end-user's OpenGL implementation" but for this use-case we can safely assume it'll be a current-generation OpenGL 4.2-capable nVidia or ATI driver providing that implementation.
From your experience, considering we might be "Drawing" a million or so vertices per UseProgram() pass -- would pre-multiplying at least the first two (perspective-projection and camera-transform matrices) per UseProgram() boost performance to any significant degree? What about all three per Draw() call?
Of course, it's all about benchmarking... but I was hoping someone has fundamental, current-gen hardware-implementation-based insights I'm missing out on that might suggest either "not even worth a try, don't waste your time" or "do it by all means, as your current shader without pre-multiplication would be sheer insanity"... Thoughts?