2
votes

I have a series of images, each containing a word. Instead of running pytesseract OCR on all of the images separately (which works fine), I would like to compile the images into one large image and run pytesseract OCR on that (to lower runtime).

What is the best way to format the images to get the best results? (ie: should they be lined up horizontally, vertically, jumbled, etc.)

Also, what would be the best page segmentation mode?

I have tried horizontally concatenating the images and then using PSM 7 (treating the image as a single line of text), however, this did not produce results as good as running pytesseract OCR on each individual word image using PSM 8 (treating the image as a single word).

1
Are you actually sure that combining the images into one improves runtime? I know people do that to reduce the number of API calls when using cloud OCR APIs, but you aren't paying per API call with Tesseract.user406009
@Lalaland When I horizontally concatenated the images, the runtime was decreased significantly. This being said, that method was not adequate because tesseract recognized some words incorrectly. I am assuming that if the runtime was decreased significantly by doing that, changing the formatting could produce better results at a similarly decreased runtime.Stephane Hatgis-Kessell

1 Answers

2
votes

pytesseract is wrapping tesseract executable and therefore it wrote each image to disk and also read output from disk. Each starting of tesseract executable cause initialization of api (e.g. reading traineddata from disk).

This could not be a big problem if you are OCRing a large text/image, but if you have a plenty of short text images (e.g. word) it is waste of time/performance. Consider using tesseract C-API in python via cffi or ctype. See recent example in tesseract user forum.