I think the page-segmentation-mode is an important factor here.
Since we are trying to read column values, we could use --psm 4
(source)
import cv2
import pytesseract
img = cv2.imread("k7bqx.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
txt = pytesseract.image_to_string(gry, config="--psm 4")
We want to get the text starts with #
txt = sorted([t[:2] for t in txt if "#" in t])
Result:
['#3', '#7', '#9', '#€']
But we miss 4, 5, we could apply adaptive-thresholding
:
Result:
['#3', '#4', '#5', '#7', '#9', '#€']
Unfortunately, #2
and #6
are not recognized.
Code:
import cv2
import pytesseract
img = cv2.imread("k7bqx.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 252, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY_INV, blockSize=131, C=100)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 4")
txt = txt.strip().split("\n")
txt = sorted([t[:2] for t in txt if "#" in t])
print(txt)