How to deal with recognizing text inside character separators with tesseract (pre-process or through special tesseract configuration) in a proper way, especially the comb type (3rd image), like in these three images below:
https://i.stack.imgur.com/Jb5Qd.png
https://i.stack.imgur.com/GhzCa.png
https://i.stack.imgur.com/rI4c1.png
1) The specific image I tried to perform OCR on is shown below. The image is clear, high resolution and free of noise. If I feed this image straight into tesseract (tried pretty much all page segment modes), the output is the following:
1
11, 9;9j1 | 0,7 4142 |
As observed, the digits are correctly OCRed and appeared as a subset of extracted text. However, the separators are also recognized as "1", ",", "7", "4", "|". The expected output is 1992 07 12.
2) I am new to image recognition. Image pre-processing is an important step before OCR. I have tried floodfill from left, bottom, and right to remove the character separators. The concept is taken from here: https://www.learnopencv.com/filling-holes-in-an-image-using-opencv-python-c/ Although this solution works for this specific image, it is definitely not a general solution. Since these character separators are common in many forms, there must be a good way to extract text.
3) I have tried googling and could not find anything solid (a lot of noise on unrelated topics) within the first 10 pages of results. My search term is "tesseract character separator". The poor results may be due to the poor choice of search term which are different than what the CV community uses.
4) I have tried abbyy finereader, and text is recognized without problem. However, this application is paid and closed source.