Improve Tesseract OCR results by removing special characters

Question

I am doing the tesseract conversion on some pdf, image, tiff files saved in my db. But while doing it I am getting lots of garbage text output from various files. For example, in this case the image gave me the following text output.

â€œâ€˜55â€œ .'Hï¬ï¬jï¬tï¬tfâ€˜Nâ€˜Dï¬‚iâ€™iisifagï¬'aï¬fï¬‚â€˜rfÃ©-wt-â€œï¬â€˜-:-'!W',ï¬‚':ï¬fm:afJuirzv-int'g-v "3.0:â€ _â€˜ l 1: v .w 

From:Beaver Medical Internal Med. 909 797 8922 06/28/2016 11:24 #946 RODS/006

As you can see it adds some extra special characters in the starting.

just want to know if there is any control param for removing such special characters from the output, because this is happening with many input files.

Note: This is not the original image, this is only the part of screenshot of pdf that I am converting to text and also the output is a part of original output.

My question is not similar to Limit characters tesseract is looking for , because that question is for ignoring things other than letters, but in my case there are some unwanted letters, numbers in the output text, which I need to remove after using the tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz I am still getting wrong text he fhawfyhftiwlwwfuisipgkggfawfarwtwofrrletitwtfilfmjafgurrwsnnve mania a i v a an in starting of output text and also it removes the numbers too. So just want to ask whether is there any way of removing these unwanted letters, special characters, numbers that are appearing in the starting.

Possible duplicate of Limit characters tesseract is looking for — sashoalm
@sashoalm after doing this I am getting this he fhawfyhftiwlwwfuisipgkggfawfarwtwofrrletitwtfilfmjafgurrwsnn‌ve mania a i v a an izromrseaver medical internal tilted sos min seaa oslaelaoie urea arses ptoosloos numbers got disappeared. I also want the number what can be done and also need to remove the unwanted text like he fhawfyhftiwlwwfuisipgkggfawfarwtwofrrletitwtfilfmjafgurrwsnn‌ve mania a i v a an from starting. — Vibhor Bhatnagar
I tried adding this tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz0123456789@. in config also, but the output is h5g11hwvwhfvt7713fybcfuisiwiggfiwfarwtrtnifrrleiixwtfhfmjafguiuwginnve mam.u a 5 1r v .w lzromrseaver medical internal tilled. 909 797 8922 0612812016 11124 91946 p.0051006 , so still problem exists. Can you help with this ? — Vibhor Bhatnagar
@VibhorBhatnagar This text - From:Beaver Medical Internal Med. 909 797 8922 06/28/2016 11:24 #946 RODS/006 - from question above is the tesseract output apart from the leading line of garbage characters? If so, it looks the OCR accuracy still need to improve as there is still some incorrect characters recognized? — thewaywewere
@thewaywewere Yes this text is thetesseract output along with the leading line of garbage characters. What can be done in this case ? — Vibhor Bhatnagar

Liam Liam · Accepted Answer · 2017-04-25T19:31:44

Create a config file (e.g "letters") in tessdata/configs directory - usually /usr/share/tesseract/tessdata/configs or /usr/share/tesseract-ocr/tessdata/configs

And add this line to the config file:

tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz

...or maybe [a-z] works.. dunno :-) Then call tesseract similar to this:

tesseract input.tif output nobatch letters

That will limit tesseract to recognize only the wanted characters

Improve Tesseract OCR results by removing special characters

1 Answers