2
votes

I am currently working on a project that involves using the Tess4j Tesseract OCR engine. While working on this project I come along a lot of websites that state that Tesseract works best on images of at least 300 DPI (Dots per Inch).

My question is why is DPI mentioned so many times for images. I understand that when you scan an object that you want to scan it with at least 300 DPI. I just cannot figure out why this is relevant for pictures taken with a camera. DPI is as far as I know a property for the printer. Based on this property the higher it is the smaller the image but with greater quality.

Now if DPI has nothing to do with these images than I am wondering why the results on my program differs when I change the DPI property of images between 72 & 300. Is there a pre-process of Tesseract that I am unaware of?

1

1 Answers

5
votes

Actually, it is the text size at a specific DPI.

Is there a Minimum Text Size? (It won't read screen text!)

There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be "noise removed".

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#is-there-a-minimum-text-size-it-wont-read-screen-text