I think I understand what an Orthographic projection is but in case I don't can someone define this in simple terms?
I think this is part of the problem. You understand the term projection in the mathematical sense - an idempotent mapping, which is typical when reducing the dimensionality of your data. In a typical render pipeline, the "projection" matrix doesn't do any projection at all. Instead, the rendering API defines some conventions for a 3D view volume. In OpenGL, the viewing volume is defined as the cube -1 <= x,y,z <= 1
in normalized device coordinates. The sides of each cube form the six clip planes. Any geometry outside these will be clipped or culled - so these planes simply represent the edges of the screen (or actually, the viewport, but imagine the screen here is more intuitive).
The task of the projection matrix (in combination of the perspective divide by w
) is to just transform from 3D eye space (some cartesian coordinate system relative to the "camera", if one wants to think in those terms) to 3D normalized device space. There is no mathematical projection happening in the normal case. This also means that the projection matrix defines the position of the 6 clipping planes in eye space. You can basically just take the well-defined corners of the view volume in normalized device space, and apply the inverse of the projection matrix (and do another perspective divide) and get back the eight corners of the viewing volume in eye space.
As a result, the projection matrix defines which extent of the world is mapped to the screen, and the aspect ratio of the viewing frustum must equal the aspect ratio of the viewport used for rendering if the object shall appear undistorted.
For a orthographic "projection", all what the projection matrix does is defining some cuboid in eye space (usually an axis-aligned one, so it boils down to scale and a translation per dimension). Typically, such ortho transformation is defined by directly specifying the viewing volume in eye space, i.e. specifying the left
, right
, top
, bottom
and near
and far
values. The projection matrix now simply maps x_eye=left
to -1 (the left clipping plane in NDC), and x_eye=right
to 1 (the right clipping plane in NDC), and so on.
In case of a perspective "projection", the viewing volume will be a pyramid frustum in eye space. The math for that is a bit more complicated as we have to play with the homogenuous w
component, and I don't want to go into details here, but the key point I'm trying to get through here is that there still is no projection. A perspective projection transforms the pyramid view frustum into a cube in NDC, and it transforms everything inside of this volume with it - a cube in the view frustum will actually by deformed to a somewhat "inverted" pyramid frustum, where the farther away parts actually get smaller in NDC.
The only case where the real projection is happening is during rasterization when only the x
and y
coordiantes are considered - and this is always an orthographic projection along z
m and it is not done by any proejction matrix.