Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned

Question

I am trying to detect these price labels text which is always clearly preprocessed. Although it can easily read the text written above it, it fails to detect price values. I am using python bindings pytesseract although it also fails to read from the CLI commands. Most of the time it tries to recognize the part where the price as one or two characters.

Sample 1:

tesseract D:\tesseract\tesseract_test_images\test.png output

And the output of the sample image is this.

je Beutel

13

However if I crop and stretch the price to look like they are seperated and are the same font size, output is just fine.

Processed image(cropped and shrinked price):

je Beutel

1,89

How do get OCR tesseract to work as I intended, as I will be going over a lot of similar images? Edit: Added more price tags:
sample2 sample3 sample4 sample5 sample6 sample7

Try come up with an algorithm which uses e.g. the cv2.connectedComponents and cv2.boundingRect functions to detect connected regions which are of dissimilar size on the same horizontal region. You can then call tesseract after either enlarging the smaller regions, shrinking the larger regions, or isolate the dissimilar regions and make the call separately. — dROOOze
Can you write down an example to how it might work? perhaps I can feed components one by one and it would still work, but connectedComponent returns a black image — NONONONONO

Shivam K. Thakkar Shivam K. Thakkar · Accepted Answer · 2018-04-19T23:19:49

The problem is the image you are using is of small size. Now when tesseract processes the image it considers '8', '9' and ',' as a single letter and thus predicts it to '3' or may consider '8' and ',' as one letter and '9' as a different letter and so produces wrong output. The image shown below explains it.

A simple solution could be increasing its size by factor of 2 or 3 or even more as per the size of your original image and then passing to tesseract so that it detects each letter individually as shown below. (Here I increased its size by factor of 2)

Bellow is a simple python script that will solve your purpose

import pytesseract
import cv2

img = cv2.imread('dKC6k.png')
img = cv2.resize(img, None, fx=2, fy=2)

data = pytesseract.image_to_string(img)
print(data)

Detected text:

je Beutel

89
1.

Now you can simply extract the required data from the text and format it as per your requirement.

data = data.replace('\n\n', '\n')
data = data.split('\n')

dollars = data[2].strip(',').strip('.')
cents = data[1]

print('{}.{}'.format(dollars, cents))

Desired Format:

1.89

Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned

2 Answers