I have two images that are almost identical:
other.png
title.png
I use with Python script to extract the texts with Tesseract:
import pytesseract
import cv2
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def process(path):
image = cv2.imread(path)
image = cv2.bitwise_not(image)
# cv2.imshow('image', image)
# cv2.waitKey(0)
results = pytesseract.image_to_string(image, lang='eng', config='')
print(path, results)
process('title.png')
process('other.png')
Here is the output:
title.png ‘CP TOOL
other.png cP TOOL
I don't get the same results. Why? How I improve text recognition?
The images are really small but I have no control over the system that generates the images. I have tried to increase the sizes of the images before processing them:
factor = 4
width = int(image.shape[1] * factor)
height = int(image.shape[0] * factor)
dim = (width, height)
image = cv2.resize(image, dim, interpolation=cv2.INTER_AREA)
Texts from these two images are extracted properly but I have other images (not enclosed here) that still get similar issue (CP being recognized as cP in particular).
I have tried to erode/dilate the image with no interesting effects but I have very new to OCR so I probably don't do things correctly...
Thanks!