I am working on a project where I have a to detect a known picture in a scene in "real time" in a mobile context (that means I'm capturing frames using a smartphone camera and resizing the frame to be 150x225). The picture itself can be rather complex. Right now, I'm processing each frame in 1.2s in average (using OpenCV). I'm looking for ways to improve this processing time and global accuracy. My current implementation work as follow :
- Capture the frame
- Convert it to grayscale
- Detect the keypoint and extract the descriptors using ORB
- Match the descriptor (2NN) (object -> scene) and filter them with ratio test
- Match the descriptor (2NN) (scene -> object) and filter them with ratio test
- Non-symmetrical matching removal with 4. and 5.
- Compute the matching confidence (% of matched keypoints against total keypoints)
My approach might not be the right one but the results are OK even though there's a lot of room for improvement. I already noticed that SURF extraction is too slow and I couldn't manage to use homography (it might be related to ORB). All suggestions are welcome!