4
votes

Problem

I am trying to write code in Python for the Image preprocessing and recognition using Tesseract-OCR. My goal is to solve this form of captcha reliably.

Original captcha and result of each preprocessing step

Steps as of Now

  1. Greyscale and thresholding of image

  2. Image enhancing with PIL

  3. Convert to TIF and scale to >300px

  4. Feed it to Tesseract-OCR (whitelisting all uppercase alphabets)

However, I still get an rather incorrect reading (EPQ M Q). What other preprocessing steps can I take to improve accuracy? My code and additional captcha of similar nature are appended below.

similar captchas I want to solve

Code

import cv2
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
def binarize_image_using_opencv(captcha_path, binary_image_path='input-black-n-white.jpg'):
     im_gray = cv2.imread(captcha_path, cv2.IMREAD_GRAYSCALE)
     (thresh, im_bw) = cv2.threshold(im_gray, 85, 255, cv2.THRESH_BINARY)
     # although thresh is used below, gonna pick something suitable
     im_bw = cv2.threshold(im_gray, thresh, 255, cv2.THRESH_BINARY)[1]
     cv2.imwrite(binary_image_path, im_bw)

     return binary_image_path

def preprocess_image_using_opencv(captcha_path):
     bin_image_path = binarize_image_using_opencv(captcha_path)

     im_bin = Image.open(bin_image_path)
     basewidth = 300  # in pixels
     wpercent = (basewidth/float(im_bin.size[0]))
     hsize = int((float(im_bin.size[1])*float(wpercent)))
     big = im_bin.resize((basewidth, hsize), Image.NEAREST)

     # tesseract-ocr only works with TIF so save the bigger image in that format
     tif_file = "input-NEAREST.tif"
     big.save(tif_file)

     return tif_file

def get_captcha_text_from_captcha_image(captcha_path):

     # Preprocess the image befor OCR
     tif_file = preprocess_image_using_opencv(captcha_path)



get_captcha_text_from_captcha_image("path/captcha.png")

im = Image.open("input-NEAREST.tif") # the second one 
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('captchafinal.tif')
text = pytesseract.image_to_string(Image.open('captchafinal.tif'), config="-c 
tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ -psm 6")
print(text)
1

1 Answers

2
votes

Major problem comes from different orientations of letters, but not from preprocessing stage. You did common preprocessing which should work good, but you can replace thresholding with adaptive thresholding to make your program more general in a sense of brightness of your image.

I met same problem when I was working with tesseract for car license recognition. From that experience I realized that tesseract is very sensetive to orientation of the text on image. Tesseract can recognize letters well when text on image is horizontal. The more text is horizontally oriented the better result you can get.

So you have to create algorithm which will detect each letter from your captcha image, detect its orientation and rotate it to make it horizontal and then do your preprocessing, then process this rotated horizontal piece of image with tesseract and store its output in your resulting string. Then go to detect next letter and do same process and add tesseract output in your resulting string. You will need image transformation function as well, to rotate your letters. And you have to think about finding corners of your detected letters. May be this project will help you, because they rotating text on image to improve quality of tesseract.