When building a computer graphics renderer (e.g. a traditional software renderer), one often speaks in terms of model-, view- and projection matrices. A model is transformed to world-space by the model-matrix, then everything is transformed to the camera space by the view-matrix, and afterwards a projection to normalized device coordinates. Usually then some clipping against the viewing frustum occurs and triangles are split if necessary. Afterwards all visible triangles are projected to screen-space and drawn. (e.g. Shirley "Fundamentals of Computer Graphics").
On the other hand, when talking about camera pose estimations from point correspondences in computer graphics, the Perspective–n–Points problem (PnP, e.g. http://cvlab.epfl.ch/software/EPnP), these algorithms directly estimate a camera matrix that transforms an object from model-space to window-coordinates (e.g. 640x480).
My question now is how those two are tied together. If I get a camera matrix from a PnP algorithm, it doesn't seem to fit the rendering pipeline. How do they tie together? How do I properly implement my software renderer to also work with those different camera matrices from PnP algorithms?
One thing that I could imagine (but it might be an ugly hack?) is that I scale the 2D-image points that I feed to the PnP algorithm from the image-resolution (e.g. [0, 640]) to [-1, 1] to "fool" the algorithm to give me normalized device coordinates and that might give me a view matrix that the renderer can use.
While there are a handful of blog-posts/tutorials out there "how to convert OpenCV solvePnP camera matrix to OpenGL" and the like, they don't really help me to understand the core of the problem, how those two things are linked and how to properly implement such a scenario when I have a camera matrix from an algorithm like cv::solvePnP (in that case from OpenCV) that directly transform from world to image-coords, and I want to "feed" that matrix to my software-renderer that has a different, "computer-graphics inspired" pipeline.
Or maybe my software-rendering approach described in the first paragraph is "wrong" (meaning not optimal)?