If you're looking for a little intuition, here's an example of why one plane isn't enough. Imagine your calibration chessboard is tilting away from you at a 45° angle:
You can see that when you move up the chessboard by 1 meter in the +y direction, you also move away from the camera by 1 meter in the +z direction. This means there's no way to separate the effect of moving in the y direction vs the z direction. The y and z movement directions are effectively tied to each other, for all our training points. So, if we just look at points on this one plane, there's no way to tease apart the effects of y movement vs z movement.
For example, from this 1 plane, we can't tell the difference between these scenarios:
- The camera has perspective distortion such that things appear smaller in the image as they move in the world's +y direction.
- The camera focal length is such that things appear smaller in the image as they move in the world's +z direction.
- Any mixture of the effects in #1 and #2.
Mathematically, this ambiguity means that there are many equally possible solutions when OpenCV tries to fit a camera matrix to match the data. (Note that the 45° angle was not important. Any plane you choose will have the same problem: training examples' (x,y,z) dimensions are coupled together, so you can't separate their effects.)
One last note: if you make enough assumptions about the camera matrix (e.g. no perspective distortion, x and y scale identically, etc) then you can end up with a situation with fewer unknowns (in an extreme case, maybe you're just calculating the focal length) and in that case you could calibrate with just 1 plane.