tesseract - how to deal with character separators

Question

How to deal with recognizing text inside character separators with tesseract (pre-process or through special tesseract configuration) in a proper way, especially the comb type (3rd image), like in these three images below:

https://i.stack.imgur.com/Jb5Qd.png
https://i.stack.imgur.com/GhzCa.png
https://i.stack.imgur.com/rI4c1.png

1) The specific image I tried to perform OCR on is shown below. The image is clear, high resolution and free of noise. If I feed this image straight into tesseract (tried pretty much all page segment modes), the output is the following:

1
11, 9;9j1 | 0,7 4142 |

As observed, the digits are correctly OCRed and appeared as a subset of extracted text. However, the separators are also recognized as "1", ",", "7", "4", "|". The expected output is 1992 07 12.

2) I am new to image recognition. Image pre-processing is an important step before OCR. I have tried floodfill from left, bottom, and right to remove the character separators. The concept is taken from here: https://www.learnopencv.com/filling-holes-in-an-image-using-opencv-python-c/ Although this solution works for this specific image, it is definitely not a general solution. Since these character separators are common in many forms, there must be a good way to extract text.

3) I have tried googling and could not find anything solid (a lot of noise on unrelated topics) within the first 10 pages of results. My search term is "tesseract character separator". The poor results may be due to the poor choice of search term which are different than what the CV community uses.

4) I have tried abbyy finereader, and text is recognized without problem. However, this application is paid and closed source.

@GhostCat I have improved my post and hopefully someone can provide me some direction/suggestion. The lack on the information on the internet related to recognizing characters inside a character separator is severely lacking. Which means probably am doing something wrong and that is why I am posting this questions. I am not sure what other information to provide beyond what I have updated. If you have any suggestion, please post an update. — jackluo923

Dmitrii Z. Dmitrii Z. · Accepted Answer · 2018-11-04T13:15:23

There are many ways how to solve your problem. For example, if lines which form your cells are connected - you can filter large connected components using opencv.

gray = cv2.imread('path_to_your/image.png', 0)
_, blackAndWhite = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY_INV)

nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(blackAndWhite, None, None, None, 8, cv2.CV_32S)
sizes = stats[1:, -1]
img2 = np.zeros((labels.shape), np.uint8)

for i in range(0, nlabels - 1):
    if sizes[i] <= 5000:   #CHANGE THIS VALUE TO CHANGE THRESHOLD.
        img2[labels == i + 1] = 255

res = cv2.bitwise_not(img2)

cv2.imshow('res.png', res)
cv2.waitKey(0)

Other approaches include but are not limited to detecting letters by finding contours or doing morphological operations, using heuristics like the fact that letters should be on the same line, etc...

tesseract - how to deal with character separators

1 Answers