I've arrived a couple years late on my journey. I ran into the exact same issue and found several people asking the same question but never found an answer that was simplified enough for me to understand, so I spent days learning this stuff just so I can simplify it to the essentials and post what I found here for future people.
I'll also give you some code samples at the end to do what you want in python, so stick around.
Some screen shots of my handwritten notes which explain the full process.
Page 1. Page 2. Page 3.
This is the equation I start with can be found in https://docs.opencv.org/master/d9/d0c/group__calib3d.html
Starting formula
Once you choose an origin in the real world that is the same for both cameras, you will have two of these equations with the same X, Y, Z values.
Sorry this next part you already have but others might not have gotten this far:
First you need to calibrate your camera which will give you the camera matrix and distortions (intrinsic properties) for each camera.
https://docs.opencv.org/master/dc/dbb/tutorial_py_calibration.html
You only need those two and can dump the rvecs and tvecs because this will change when you set up the camera.
Once you choose your real world coordinate system, you can use cv2.solvePnP to get the rotation and translation vectors. To do this you need a set of real world points and their corresponding coordinates in the camera for each camera. My trick was I made some code that would show an image of the field and I would pass in points. Then I would click on locations on the image and create a mapping. The code for this bit is a bit lengthy so I won't share it here unless it is requested.
cv2.solvePnP
will give you a vector for the rotation matrix, so you need to convert this to a 3x3 matrix using the following line:
`R, jac = cv2.Rodrigues(rvec)`
So now back to the original question:
You have the 3x3 camera matrix for each camera. You have the 3x3 rotation matrix for each camera. You have the 3x1 translation vector for each camera. You have some (u, v)
coordinate for where the object of interest is in each camera. The math is explained more in the image of the notes.
import numpy as np
def get_xyz(camera1_coords, camera1_M, camera1_R, camera1_T, camera2_coords, camera2_M, camera2_R, camera2_T):
# Get the two key equations from camera1
camera1_u, camera1_v = camera1_coords
# Put the rotation and translation side by side and then multiply with camera matrix
camera1_P = camera1_M.dot(np.column_stack((camera1_R,camera1_T)))
# Get the two linearly independent equation referenced in the notes
camera1_vect1 = camera1_v*camera1_P[2,:]-camera1_P[1,:]
camera1_vect2 = camera1_P[0,:] - camera1_u*camera1_P[2,:]
# Get the two key equations from camera2
camera2_u, camera2_v = camera2_coords
# Put the rotation and translation side by side and then multiply with camera matrix
camera2_P = right_M.dot(np.column_stack((camera2_R,camera2_T)))
# Get the two linearly independent equation referenced in the notes
camera2_vect1 = camera2_v*camera2_P[2,:]-camera2_P[1,:]
camera2_vect2 = camera2_P[0,:] - camera2_u*camera2_P[2,:]
# Stack the 4 rows to create one 4x3 matrix
full_matrix = np.row_stack((camera1_vect1, camera1_vect2, camera2_vect1, camera2_vect2))
# The first three columns make up A and the last column is b
A = full_matrix[:, :3]
b = full_matrix[:, 3].reshape((4, 1))
# Solve overdetermined system. Note b in the wikipedia article is -b here.
# https://en.wikipedia.org/wiki/Overdetermined_system
soln = np.linalg.inv(A.T.dot(A)).dot(A.T).dot(-b)
return soln