2
votes

I have a binary image like this,

enter image description here

I want to extract the numbers in the image using tesseract ocr in Python. I used pytesseract like this on the image,

txt = pytesseract.image_to_string(img)

But I am not getting any good results.

What can I do in pre-processing or augmentation that can help tesseract do better.?

I tried to localize the text from the image using East Text Detector but it was not able to recognize the text.

How to proceed with this in python.?

1

1 Answers

1
votes

I think the page-segmentation-mode is an important factor here.

Since we are trying to read column values, we could use --psm 4 (source)

import cv2
import pytesseract

img = cv2.imread("k7bqx.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
txt = pytesseract.image_to_string(gry, config="--psm 4")

We want to get the text starts with #

txt = sorted([t[:2] for t in txt if "#" in t])

Result:

['#3', '#7', '#9', '#€']

But we miss 4, 5, we could apply adaptive-thresholding:

enter image description here

Result:

['#3', '#4', '#5', '#7', '#9', '#€']

Unfortunately, #2 and #6 are not recognized.

Code:


import cv2
import pytesseract

img = cv2.imread("k7bqx.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 252, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY_INV, blockSize=131, C=100)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 4")
txt = txt.strip().split("\n")
txt = sorted([t[:2] for t in txt if "#" in t])
print(txt)