I've been trying to use Tesseract to identify some digits in a series of images and after scouring for advice I've made a number of improvements. So far I've attempted the following steps:
- Binarize the image at an appropriate threshold to pick out the numbers
- Restrict Tesseract to digits only
- Upscale the image using a variety of approaches (getScaledInstance with Image.SCALE_SMOOTH, AffineTransform using AffineTransformOp.TYPE_BICUBIC)
- Explore different Tesseract page segmentation modes. Currently using mode 6.
The numbers are all identical in shape and perfectly aligned, though their edges are somewhat jagged. Example processed images:
Tesseract does okay with these, but it often confuses 8 for 3, 6 for 5, 9 for 5.
I've been looking a little at different ways to smooth the image and trying different scales, but I'm also wondering if it makes more sense to just go through the process of training Tesseract. With only 10 possible values that are always almost identical, it seems like it shouldn't be too hard for it to learn to recognize them, but training Tesseract also seems like a huge pain.
Any suggestions on how to get the final bit of accuracy out of Tesseract on these images?
I'm using tess4j and Java, so Java-specific suggestions and libraries are especially appreciated. While I'm willing to implement algorithms myself, I'd hate to reinvent the wheel.