1
votes

I am working on a pedestrian tracking algorithm using Python3 & OpenCV.

We can use SIFT keypoints as an identifier of a pedestrian silhouette on a frame and then perform brute force matching between two sets of SIFT keypoints (i.e. between one frame and the next one) to find the pedestrian in the next frame. To visualize this on the sequence of frames, we can draw a bounding rectangle delimiting the pedestrian. This is what it looks like :

tracking example

The main problem is about characterizing the motion of the pedestrian using the keypoints. The idea here is to find an affine transform (that is translation in x & y, rotation & scaling) using the coordinates of the keypoints on 2 successives frames. Ideally, this affine transform somehow corresponds to the motion of the pedestrian. To track this pedestrian, we would then just have to apply the same affine transform on the bounding rectangle coordinates. That last part doesn’t work well. The rectangle consistently shrinks over several frames to inevitably disappear or drifts away from the pedestrian, as you see below or on the previous image :

tracking results

To specify, we characterize the bounding rectangle with 2 extreme points :

bounding rectangle coordinates

There are some built-in cv2 functions that can apply an affine transform to an image, like cv2.warpAffine(), but I want to apply it only to the bounding rectangle coordinates (i.e 2 points or 1 point + width & height).

To find the affine transform between the 2 sets of keypoints, I’ve written my own function (I can post the code if it helps), but I’ve observed similar results when using cv2.getAffineTransform() for instance.

Do you know how to properly apply an affine transform to this bounding rectangle ?


EDIT : here’s some explanation & code for better context :

  • The pedestrian detection is done with the pre-trained SVM classifier available in openCV : hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector()) & hog.detectMultiScale()

  • Once a first pedestrian is detected, the SVM returns the coordinates of the associated bounding rectangle (xA, yA, w, h) (we stop using the SVM after the 1st detection as it is quite slow, and we are focusing on one pedestrian for now)

  • We select the corresponding region of the current frame, with image[yA: yA+h, xA: xA+w] and search for SURF keypoints within with surf.detectAndCompute()

  • This returns the keypoints & their associated descriptors (an array of 64 characteristics for each keypoint)

  • We perform brute force matching, based on the L2-norm between the descriptors and the distance in pixels between the keypoints to construct pairs of keypoints between the current frame & the previous one. The code for this function is pretty long, but should be similar to cv2.BFMatcher(cv2.NORM_L2, crossCheck=True)

  • Once we have the matched pairs of keypoints, we can use them to find the affine transform with this function :

    previousKpts = previousKpts[:5]  # select 4 best matches
    currentKpts = currentKpts[:5]
    
    # build A matrix of shape [2 * Nb of keypoints, 4]
    A = np.ndarray(((2 * len(previousKpts), 4)))
    
    for idx, keypoint in enumerate(previousKpts):
        # Keypoint.pt = (x-coord, y-coord)
        A[2 * idx, :] = [keypoint.pt[0], -keypoint.pt[1], 1, 0]
        A[2 * idx + 1, :] = [keypoint.pt[1], keypoint.pt[0], 0, 1]
    
    # build b matrix of shape [2 * Nb of keypoints, 1]
    b = np.ndarray((2 * len(previousKpts), 1))
    
    for idx, keypoint in enumerate(currentKpts):
        b[2 * idx, :] = keypoint.pt[0]
        b[2 * idx + 1, :] = keypoint.pt[1]
    
    # convert the numpy.ndarrays to matrix :
    A = np.matrix(A)
    b = np.matrix(b)
    
    # solution of the form x = [x1, x2, x3, x4]' = ((A' * A)^-1) * A' * b
    x = np.linalg.inv(A.T * A) * A.T * b
    
    theta = math.atan2(x[1, 0], x[0, 0])  # outputs rotation angle in [-pi, pi]
    alpha = math.sqrt(x[0, 0] ** 2 + x[1, 0] ** 2)  # scaling parameter
    bx = x[2, 0]  # translation along x-axis
    by = x[3, 0]  # translation along y-axis
    
    
    return theta, alpha, bx, by
    
  • We then just have to apply the same affine transform to the corner points of the bounding rectangle :

    # define the 4 bounding points using xA, yA
    xB = xA + w
    yB = yA + h
    rect_pts = np.array([[[xA, yA]], [[xB, yA]], [[xA, yB]], [[xB, yB]]], dtype=np.float32)
    
    # warp the affine transform into a full perspective transform
    affine_warp = np.array([[alpha*np.cos(theta), -alpha*np.sin(theta), tx],
                            [alpha*np.sin(theta), alpha*np.cos(theta), ty],
                            [0, 0, 1]], dtype=np.float32)
    
    # apply affine transform
    rect_pts = cv2.perspectiveTransform(rect_pts, affine_warp)
    xA = rect_pts[0, 0, 0]
    yA = rect_pts[0, 0, 1]
    xB = rect_pts[3, 0, 0]
    yB = rect_pts[3, 0, 1]
    
    return xA, yA, xB, yB
    
  • Save the updated rectangle coordinates (xA, yA, xB, yB), all current keypoints & descriptors, and iterate over the next frame : select image[yA: yB, xA: xA] using (xA, yA, xB, yB) we previously saved, get SURF keypoints etc.

1
You should take this question to cross validated, because you aren't asking about the programming language. Instead, you're asking a math question with a reference to computer science.user9048861
cv2.perspectiveTransform function on all 4 bounding box pointsMicka
Small comment: some_list[:5] returns the first five elements, with indices 0, 1, 2, 3, 4...not the first four.alkasm

1 Answers

1
votes

As Micka suggested, cv2.perspectiveTransform() is an easy way to accomplish this. You'll just need to turn your affine warp into a full perspective transform (homography) by adding a third row at the bottom with the values [0, 0, 1]. For example, let's put a box with w, h = 100, 200 at the point (10, 20) and then use an affine transformation to shift the points so that the box is moved to (0, 0) (i.e. shift 10 pixels to the left and 20 pixels up):

>>> xA, yA, w, h = (10, 20, 100, 200)
>>> xB, yB = xA + w, yA + h
>>> rect_pts = np.array([[[xA, yA]], [[xB, yA]], [[xA, yB]], [[xB, yB]]], dtype=np.float32)
>>> affine_warp = np.array([[1, 0, -10], [0, 1, -20], [0, 0, 1]], dtype=np.float32)
>>> cv2.perspectiveTransform(rect_pts, affine_warp)
array([[[   0.,    0.]],

       [[ 100.,    0.]],

       [[   0.,  200.]],

       [[ 100.,  200.]]], dtype=float32)

So that works perfectly as expected. You could also just simply transform the points yourself with matrix multiplication:

>>> rect_pts.dot(affine_warp[:, :2]) + affine_warp[:, 2]
array([[[   0.,    0.]],

       [[ 100.,    0.]],

       [[   0.,  200.]],

       [[ 100.,  200.]]], dtype=float32)