I am working on a pedestrian tracking algorithm using Python3 & OpenCV.
We can use SIFT keypoints as an identifier of a pedestrian silhouette on a frame and then perform brute force matching between two sets of SIFT keypoints (i.e. between one frame and the next one) to find the pedestrian in the next frame. To visualize this on the sequence of frames, we can draw a bounding rectangle delimiting the pedestrian. This is what it looks like :
The main problem is about characterizing the motion of the pedestrian using the keypoints. The idea here is to find an affine transform (that is translation in x & y, rotation & scaling) using the coordinates of the keypoints on 2 successives frames. Ideally, this affine transform somehow corresponds to the motion of the pedestrian. To track this pedestrian, we would then just have to apply the same affine transform on the bounding rectangle coordinates. That last part doesn’t work well. The rectangle consistently shrinks over several frames to inevitably disappear or drifts away from the pedestrian, as you see below or on the previous image :
To specify, we characterize the bounding rectangle with 2 extreme points :
There are some built-in cv2 functions that can apply an affine transform to an image, like cv2.warpAffine(), but I want to apply it only to the bounding rectangle coordinates (i.e 2 points or 1 point + width & height).
To find the affine transform between the 2 sets of keypoints, I’ve written my own function (I can post the code if it helps), but I’ve observed similar results when using cv2.getAffineTransform() for instance.
Do you know how to properly apply an affine transform to this bounding rectangle ?
EDIT : here’s some explanation & code for better context :
The pedestrian detection is done with the pre-trained SVM classifier available in openCV :
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())
&hog.detectMultiScale()
Once a first pedestrian is detected, the SVM returns the coordinates of the associated bounding rectangle
(xA, yA, w, h)
(we stop using the SVM after the 1st detection as it is quite slow, and we are focusing on one pedestrian for now)We select the corresponding region of the current frame, with
image[yA: yA+h, xA: xA+w]
and search for SURF keypoints within withsurf.detectAndCompute()
This returns the keypoints & their associated descriptors (an array of 64 characteristics for each keypoint)
We perform brute force matching, based on the L2-norm between the descriptors and the distance in pixels between the keypoints to construct pairs of keypoints between the current frame & the previous one. The code for this function is pretty long, but should be similar to
cv2.BFMatcher(cv2.NORM_L2, crossCheck=True)
Once we have the matched pairs of keypoints, we can use them to find the affine transform with this function :
previousKpts = previousKpts[:5] # select 4 best matches currentKpts = currentKpts[:5] # build A matrix of shape [2 * Nb of keypoints, 4] A = np.ndarray(((2 * len(previousKpts), 4))) for idx, keypoint in enumerate(previousKpts): # Keypoint.pt = (x-coord, y-coord) A[2 * idx, :] = [keypoint.pt[0], -keypoint.pt[1], 1, 0] A[2 * idx + 1, :] = [keypoint.pt[1], keypoint.pt[0], 0, 1] # build b matrix of shape [2 * Nb of keypoints, 1] b = np.ndarray((2 * len(previousKpts), 1)) for idx, keypoint in enumerate(currentKpts): b[2 * idx, :] = keypoint.pt[0] b[2 * idx + 1, :] = keypoint.pt[1] # convert the numpy.ndarrays to matrix : A = np.matrix(A) b = np.matrix(b) # solution of the form x = [x1, x2, x3, x4]' = ((A' * A)^-1) * A' * b x = np.linalg.inv(A.T * A) * A.T * b theta = math.atan2(x[1, 0], x[0, 0]) # outputs rotation angle in [-pi, pi] alpha = math.sqrt(x[0, 0] ** 2 + x[1, 0] ** 2) # scaling parameter bx = x[2, 0] # translation along x-axis by = x[3, 0] # translation along y-axis return theta, alpha, bx, by
We then just have to apply the same affine transform to the corner points of the bounding rectangle :
# define the 4 bounding points using xA, yA xB = xA + w yB = yA + h rect_pts = np.array([[[xA, yA]], [[xB, yA]], [[xA, yB]], [[xB, yB]]], dtype=np.float32) # warp the affine transform into a full perspective transform affine_warp = np.array([[alpha*np.cos(theta), -alpha*np.sin(theta), tx], [alpha*np.sin(theta), alpha*np.cos(theta), ty], [0, 0, 1]], dtype=np.float32) # apply affine transform rect_pts = cv2.perspectiveTransform(rect_pts, affine_warp) xA = rect_pts[0, 0, 0] yA = rect_pts[0, 0, 1] xB = rect_pts[3, 0, 0] yB = rect_pts[3, 0, 1] return xA, yA, xB, yB
Save the updated rectangle coordinates
(xA, yA, xB, yB)
, all current keypoints & descriptors, and iterate over the next frame : selectimage[yA: yB, xA: xA]
using(xA, yA, xB, yB)
we previously saved, get SURF keypoints etc.
some_list[:5]
returns the first five elements, with indices 0, 1, 2, 3, 4...not the first four. – alkasm