You have some misconceptions about the process.
The method cv::estimateRigidTransform takes as input two sets of corresponding points. And then solves set of equations to find the transformation matrix. The output of the transformation matches src points to dst points (exactly or closely, if exact match is not possible - for example float coordinates).
If you apply estimateRigidTransform on two images, OpenCV first find matching pairs of points using some internal method (see opencv docs).
cv::warpAffine then transforms the src image to dst according to given transformation matrix. But any (almost any) transformation is loss operation. The algorithm has to estimate some data, because they aren't available. This process is called interpolation, using known information you calculate the unknown value. Some info regarding image scaling can be found on wiki. Same rules apply to other transformations - rotation, skew, perspective... Obviously this doesn't apply to translation.
Given your test images, I would guess that OpenCV takes the lampshade as reference. From the difference is clear that the lampshade is transformed best. Default the OpenCV uses linear interpolation for warping as it's fastest method. But you can set more advances method for better results - again consult opencv docs.
Conclusion:
The result you got is pretty good, if you bear in mind, it's result of automated process. If you want better results, you'll have to find another method for selecting corresponding points. Or use better interpolation method. Either way, after the transform, the diff will not be 0. It virtually impossible to achieve that, because bitmap is discrete grid of pixels, so there will always be some gaps, which needs to be estimated.