I am working on a project that tracks humans from aerial videos. One of the algorithms that we will use is SURF. Now I understand that SURF uses interest points, but I'm quite confused with comes after that. How exactly can I use the interest points for classification? I want to identify which detected objects in the video are humans or objects, so of course I need a training set, but what will I use? I've read somewhere that BoW should be used, but are there any other ways of extracting these SURF features? If I read the original SURF paper by Herbert Bay correctly, how the features were extracted, what the output was, and how they were prepared for classification were not mentioned.
I'm really confused. Please help. Thank you!