I tried to improved the results of OpenSource OCR software. I'm using tessaract, because I find it still produces better results than gocr, but with bad quality input it has huge problems. So I tried to prepocess the image with various tools I found in the internet:
- unpaper
- Fred's ImageMagick Scripts: TEXTCLEANER
- manuall using GIMP
But I was not able to get good results with this bad test document: (really just for test, I don't need to content of this file) http://9gag.com/gag/aBrG8w2/employee-handbook
This online service works surprisingly good with this test document: http://www.onlineocr.net/
I'm wonderung if it is possible using smart preprocessing to get similar results with tesseract. Are the OpenSource OCR engines really so bad compared to commercial ones? Even google uses tesseract to scan documents, so I was expecting more...