"stitching" two pictures to find best translation and rotation python

Question

I have two images from a video, frameA and frameB. Assuming the video is panning, slowly, one can imagine that frameA and frameB have significant overlap. We can then create a panorama from the video footage.

I have tried using: opencv2.stitcher, SURF/ORB detectors with BF matching, and a few vanilla approaches. None of them are producing the results that I need [for some reason]. The main problem I am identifying is that SURF/ORB is identify too "small" a region of interest and matching incorrectly.

Example: I am in a desert with 1 single cactus in my view. I am panning across it. The SURF/ORB is detecting regions of interest such as the EDGES of my cactus with sky/land and unable to match (not sure why) it in the next frame. The things it does detect, it does not match up well and when you use homography, it matches say the middle of the cactus with the top part of the cactus in the next frame... and it gets warped.

Is there a way to do the following?

Enforce only rotation and translation? between 2 frames -- note that there is "new" information in subsequent frames, so it can never 100% overlap.
Find best rotation and translation, with the base assumption that there is a best match? (i am very very slowly panning, and guarantee high overlap).
Ignore minor fluctuations. If my feature detectors were "large" enough, it would say, "cactus in frame 1" matches "catctus in frame 2", translate by X,Y and maybe rotate by Z.

My attempt at a solution is take the entire picture and do an "overlapping" sweep, and find the difference. Where I have a minimum, I have the proper X,Y shift. This however has two problems:

It's slow. way too slow.
it can't do rotation, without being even more slow due to search space increase.


    image1 = cv2.imread('img1.png')
    print(image1.shape)
    
    img1 = cv2.cvtColor(image1,cv2.COLOR_BGR2GRAY)
    nw1, nh1 = img1.shape
    nw15, nh15 = int(nw1/2), int(nh1/2)
    
    
    # load image 2
    image2 = cv2.imread('img2.png')
    img2 = cv2.cvtColor(image2,cv2.COLOR_BGR2GRAY)
    nw2, nh2 = img2.shape
    nw25, nh25 = int(nw2/2), int(nh2/2)
    
    # generate base canvas, note that img1 could be top left of img2, or img2 could be top left of img1
    # the search space of this is very large
    nw, nh = nw1+nw2*2, nh1+nh2*2
    cnw, cnh = int(nw/2), int(nh/2)  # get the center point for later calculations
    
    
    base_image1 = np.ones((nw,nh), np.uint8)*255  # make the background white
    base_image1[cnw-nw15: cnw+nw15, cnh-nh15: cnh+nh15] = img1 # set the first image in the center
    
    # create the image we want to "sweep over" we "pre-allocate" since creating new ones is expensive.
    sweep_image = np.zeros((nw,nh), np.uint8) # keep at 0 for BLACK

    import time
    
    stime = time.time()
    total_blend = []
    
    # sweep over my search space!
    for x_s in np.arange(20, 80): # limit search space so it finish this year
        for y_s in np.arange(300, 500): # limit search space so it finish this year
            w1, w2 = cnw-nw25+x_s, cnw+nw25+x_s  # get the width slice to set our sweep image
            h1, h2 = cnh-nh25+y_s, cnh+nh25+y_s  # get the height slice to set our sweep image
            
            sweep_image[w1: w2, h1: h2] = img2 # set the image
            diff = cv2.absdiff(base_image1, sweep_image) # calculate the difference
            
            total_blend.append([x_s, y_s, np.sum([diff])]) # store the transformation and coordinates
            
            sweep_image[w1: w2, h1: h2] = 0 # reset back to zero
            cv2.imshow('diff',diff)
            cv2.waitKey(0)
            
    print(time.time() - stime)

    # convert to array 
    total_blend = np.array(total_blend)
    mymin = np.min(total_blend[:,2])
    print(total_blend[total_blend[:,2]==mymin]) # get the best coordinates for translation

Example below: Example 1: Note the giant white borders, due to making sure the images the same size across the ENTIRE search space. Example 1, here is an ok ish match, but notice how the dark regions aren't very dark.

Example 2: (large white borders), but notice how the dark regions are actually black. This is close to minimum.

All help and thoughts appreciated. Is there a way to dictate the "size" of feature detectors? Is there a faster way to sweep? Maybe some RMSE and numpy eigenvalues - this is linear algebra after all...?

I am using python3, opencv2.

How about scikit-image.org/docs/stable/auto_examples/transform/… . . ? — AKX
Or scikit-image.org/docs/stable/auto_examples/registration/… ..? — AKX