3
votes

How to deal with recognizing text inside character separators with tesseract (pre-process or through special tesseract configuration) in a proper way, especially the comb type (3rd image), like in these three images below:

https://i.stack.imgur.com/Jb5Qd.png
https://i.stack.imgur.com/GhzCa.png
https://i.stack.imgur.com/rI4c1.png

1) The specific image I tried to perform OCR on is shown below. enter image description here The image is clear, high resolution and free of noise. If I feed this image straight into tesseract (tried pretty much all page segment modes), the output is the following:

1
11, 9;9j1 | 0,7 4142 |

As observed, the digits are correctly OCRed and appeared as a subset of extracted text. However, the separators are also recognized as "1", ",", "7", "4", "|". The expected output is 1992 07 12.

2) I am new to image recognition. Image pre-processing is an important step before OCR. I have tried floodfill from left, bottom, and right to remove the character separators. The concept is taken from here: https://www.learnopencv.com/filling-holes-in-an-image-using-opencv-python-c/ Although this solution works for this specific image, it is definitely not a general solution. Since these character separators are common in many forms, there must be a good way to extract text.

3) I have tried googling and could not find anything solid (a lot of noise on unrelated topics) within the first 10 pages of results. My search term is "tesseract character separator". The poor results may be due to the poor choice of search term which are different than what the CV community uses.

4) I have tried abbyy finereader, and text is recognized without problem. However, this application is paid and closed source.

1
@GhostCat I have improved my post and hopefully someone can provide me some direction/suggestion. The lack on the information on the internet related to recognizing characters inside a character separator is severely lacking. Which means probably am doing something wrong and that is why I am posting this questions. I am not sure what other information to provide beyond what I have updated. If you have any suggestion, please post an update.jackluo923
I think it looks better now! Good luck!GhostCat
Shouldn't it be "1991 07 12"?bballdave025

1 Answers

4
votes

There are many ways how to solve your problem. For example, if lines which form your cells are connected - you can filter large connected components using opencv.

gray = cv2.imread('path_to_your/image.png', 0)
_, blackAndWhite = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY_INV)

nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(blackAndWhite, None, None, None, 8, cv2.CV_32S)
sizes = stats[1:, -1]
img2 = np.zeros((labels.shape), np.uint8)

for i in range(0, nlabels - 1):
    if sizes[i] <= 5000:   #CHANGE THIS VALUE TO CHANGE THRESHOLD.
        img2[labels == i + 1] = 255

res = cv2.bitwise_not(img2)

cv2.imshow('res.png', res)
cv2.waitKey(0)

enter image description here

Other approaches include but are not limited to detecting letters by finding contours or doing morphological operations, using heuristics like the fact that letters should be on the same line, etc...