3
votes

When building a computer graphics renderer (e.g. a traditional software renderer), one often speaks in terms of model-, view- and projection matrices. A model is transformed to world-space by the model-matrix, then everything is transformed to the camera space by the view-matrix, and afterwards a projection to normalized device coordinates. Usually then some clipping against the viewing frustum occurs and triangles are split if necessary. Afterwards all visible triangles are projected to screen-space and drawn. (e.g. Shirley "Fundamentals of Computer Graphics").

On the other hand, when talking about camera pose estimations from point correspondences in computer graphics, the Perspective–n–Points problem (PnP, e.g. http://cvlab.epfl.ch/software/EPnP), these algorithms directly estimate a camera matrix that transforms an object from model-space to window-coordinates (e.g. 640x480).

My question now is how those two are tied together. If I get a camera matrix from a PnP algorithm, it doesn't seem to fit the rendering pipeline. How do they tie together? How do I properly implement my software renderer to also work with those different camera matrices from PnP algorithms?

One thing that I could imagine (but it might be an ugly hack?) is that I scale the 2D-image points that I feed to the PnP algorithm from the image-resolution (e.g. [0, 640]) to [-1, 1] to "fool" the algorithm to give me normalized device coordinates and that might give me a view matrix that the renderer can use.

While there are a handful of blog-posts/tutorials out there "how to convert OpenCV solvePnP camera matrix to OpenGL" and the like, they don't really help me to understand the core of the problem, how those two things are linked and how to properly implement such a scenario when I have a camera matrix from an algorithm like cv::solvePnP (in that case from OpenCV) that directly transform from world to image-coords, and I want to "feed" that matrix to my software-renderer that has a different, "computer-graphics inspired" pipeline.

Or maybe my software-rendering approach described in the first paragraph is "wrong" (meaning not optimal)?

3

3 Answers

2
votes

Typically in computer graphics one has the following 4x4 matrix transforms: model-to-world, world-to-view, view-to-screen. The latter is sometimes called the projection matrix, and the screen coordinates are usually normalized in [-1, 1].

Typically in computer vision, a "projection matrix" is a model-to-image matrix (usually 3x4) that maps from a 3D world space to the 2D image space, where the latter is unnormalized in [0, width) x [0, height).

One thing you should notice is that if you take your computer graphics matrices, and multiply them together in the correct order, you will have a 4x4 matrix that maps from model cords to normalized screen space. If you compose that with another matrix that scales and translates from normalized screen coords to image coords, you will have something that matches the computer vision projection matrix. (Ignoring the 4x4 versus 3x4 difference.)

screen-to-image * view-to-screen * world-to-view * model-to-world = model-to-image

(Where I am assuming matrices that multiply column vectors on the left, y = Ax. Sometimes in graphics the opposite convention is used, but it's best avoided. You need to check your conventions and if necessary transpose matrices to use the same convention throughout.)

The computer vision way of doing things is often more mathematically convenient. But if you want to create separate matrices for a graphics pipeline, you can do the following:

Set your model-to-world transform to be identity. Extract the rigid part of the projection matrix, i.e. the camera rotation and translation, and set that as your view matrix (world-to-camera). Set the remainder of the transform to be the projection matrix, taking into account the normalized screen space.

If you need details, look at Chapter 6 of Multiple View Geometry In Computer Vision by Hartley and Zisserman, 2nd ed.

An easier alternative that works in some situations is just to leave everything but the view-to-screen matrix as identity. Then take the world-to-image matrix and multiply it by an image-to-normalized-screen transform. The resultant world-to-normalized-screen transform can be used as the projection matrix in the graphics pipeline.

i.e. view-to-screen := image-to-screen * model-to-image

1
votes

The view matrix should be the inverse of the camera matrix.

The intuition is something like: if the camera moves left 1 unit the view matrix needs to move the rest of the scene right 1 unit.

1
votes

Your "ugly hack" is actually the correct procedure.

A camera estimation procedure (e.g. PnP) is based on measurements performed on real images, produced by a real device with a well defined sensor resolution. So it is proper to express its output in terms of pixel coordinates, since this is the space in which the model and its prediction errors are naturally expressed. Any other representation, even when bijective, would be a gimmick masking the geometrical quantities that are actually measured.

However, once you have estimated a camera pose-and-projection model (i.e. a function transforming 3D points to pixels), you are free to remap it to whatever coordinate spaces you fancy, including normalized device coordinates as you indicate.