0
votes

I have two images that are almost identical:

other.png

other.png

title.png

title.png

I use with Python script to extract the texts with Tesseract:

import pytesseract
import cv2

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def process(path):
    image = cv2.imread(path)
    image = cv2.bitwise_not(image)
    # cv2.imshow('image', image)
    # cv2.waitKey(0)
    results = pytesseract.image_to_string(image, lang='eng', config='')
    print(path, results)

process('title.png')
process('other.png')

Here is the output:

title.png ‘CP TOOL
other.png cP TOOL

I don't get the same results. Why? How I improve text recognition?

The images are really small but I have no control over the system that generates the images. I have tried to increase the sizes of the images before processing them:

    factor = 4
    width = int(image.shape[1] * factor)
    height = int(image.shape[0] * factor)
    dim = (width, height)
    image = cv2.resize(image, dim, interpolation=cv2.INTER_AREA)

Texts from these two images are extracted properly but I have other images (not enclosed here) that still get similar issue (CP being recognized as cP in particular).

I have tried to erode/dilate the image with no interesting effects but I have very new to OCR so I probably don't do things correctly...

Thanks!

1

1 Answers

0
votes

OCR systems are not perfect but there are several things you can make to improve result based on your use case:

  • You try to improve input image quality before using tesseract
  • You can change the config in the image_to_string function
  • You can retrain tesseract for a new fonts
  • You can try another OCR system
  • You can train your custom computer vision model

I recommend checking tesseract documentation https://github.com/tesseract-ocr/tessdoc for more information about improving quality, config options and retraining tesseract