If your input device allows it, you can simply use "GetAlternativeViewPointCap", as shown in the following C++ code. In this case, the depth map is automatically transformed so to become aligned with the color image. Therefore, given a coordinate (x,y) of a pixel on the color image, it becomes sufficient to query the depth map at the same position.
m_context.InitFromXmlFile(path,m_scriptNode);
m_context.FindExistingNode(XN_NODE_TYPE_IMAGE, m_imageGenerator);
m_context.FindExistingNode(XN_NODE_TYPE_DEPTH, m_depthGenerator);
if (m_depthGenerator.IsCapabilitySupported(XN_CAPABILITY_ALTERNATIVE_VIEW_POINT)) {
m_depthGenerator.GetAlternativeViewPointCap().SetViewPoint(m_imageGenerator);
}
If this approach is not viable, you should estimate the transformation between the two cameras. A book such as "Multiple View Geometry in Computer Vision" describes all the necessary background and algorithms.