1
votes

This question is a follow-up from this answered question. I'm using Tesseract with python to read some dates from small images. The solution provided in the link worked for most cases, but I just found out that it is not able to read the character "5".

This is the raw image I'm working with:

enter image description here

Following the advice provided in the former question I have pre-processed the image to get this one:

enter image description here

It looks nice, but Tesseract is still not able to read the first "5". It produces o MAY 2021

How can I fine-tune Tesseract, either via parameters or image pre-processing, to get the correct reading?

1

1 Answers

1
votes

Since the image I small, I resized the image first. Then I binarized the grayscale image because tesseract gives more accurate outputs with binarized images.

>>> img = cv2.imread("5.jpg")
>>> img = cv2.copyMakeBorder(img, 50, 50, 50, 50, cv2.BORDER_CONSTANT, value=[0, 0, 0])
>>> img = cv2.resize(img,None,fx=2, fy=2, interpolation = cv2.INTER_CUBIC)
>>> gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
>>> otsu = cv2.threshold(gry,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
>>> otsu = 255-otsu
>>> pytesseract.image_to_string(otsu)
'5 MAY 2021\n\x0c'
>>> print(pytesseract.image_to_string(otsu))
5 MAY 2021


>>>