I was thinking of training the convnet on a big set of cropped hand
images (+ random images without hands) and then apply the classifier
on all the subsquares of my images. Is this a good approach?
Yes, I believe this would be a good approach. However, note that when you say random, you should perhaps sample it from images where "hands are most likely to appear". It really depends on your use case, and you have to tune the data set to fit what you're doing.
How you should build your data set, would be something like this:
- Crop images of hands from a big image.
- Sample X number of images from that same image, but not anywhere near the hand/hands.
If however, you should choose to do something like this:
- Crop images of hands from a big image.
- Download 1 million images (an exaggeration) that definitely don't have hands. For example, deserts, oceans, skies, caves, mountains, basically lots of scenery. And then use this as your "random images without hands", you might get bad results.
The reason for this, is because there is an underlying distribution already. I assume that most of your images could be pictures of groups of friends, having a party at a house, or perhaps the background images would be buildings. Hence, introducing scenery images, could corrupt this distribution, whilst holding the above assumption.
Therefore, be really careful when using "random images"!
on all the subsquares of my images
As to this part of your question, you are essentially running a sliding window on the entire image. Yes, practically, it would work. But if you're looking for performance, this may not be a good idea. You might want to run some segmentation algorithms, to narrow down the search space.
Are there other examples of complex 2-class convnets / RNNs I could
use for inspiration?
I'm not sure what you mean by complex 2-class convnets. I'm not familiar with RNNs, so let me focus on convnets. You can basically define the convolutional net yourself. For example, the convolutional layers size, how many layers, what's your max pooling method, how big is your fully connected layer going to be, etc. The last layer, is basically a softmax layer, where the net decides what class it's going to be. If you have 2 classes, your last layer has 2 nodes. If you have 3, then 3. And so on. So it can range from 2, to perhaps even 1000. I've not heard of convnets that have more than 1000 classes, but I could be ill-informed. I hope this helps!