I am doing the tesseract
conversion on some pdf
, image
, tiff
files saved in my db. But while doing it I am getting lots of garbage text output from various files. For example, in this case the image gave me the following text output.
“‘55“ .'Hï¬ï¬jï¬tï¬tf‘N‘Dfli’iisifagï¬'aï¬ffl‘rfé-wt-“ï¬â€˜-:-'!W',fl':ï¬fm:afJuirzv-int'g-v "3.0:†_‘ l 1: v .w
From:Beaver Medical Internal Med. 909 797 8922 06/28/2016 11:24 #946 RODS/006
As you can see it adds some extra special characters in the starting.
just want to know if there is any control param for removing such special characters from the output, because this is happening with many input files.
Note: This is not the original image, this is only the part of screenshot of pdf that I am converting to text and also the output is a part of original output.
My question is not similar to Limit characters tesseract is looking for , because that question is for ignoring things other than letters, but in my case there are some unwanted letters, numbers in the output text, which I need to remove after using the tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz
I am still getting wrong text he fhawfyhftiwlwwfuisipgkggfawfarwtwofrrletitwtfilfmjafgurrwsnnve mania a i v a an
in starting of output text and also it removes the numbers too. So just want to ask whether is there any way of removing these unwanted letters, special characters, numbers that are appearing in the starting.
tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz0123456789@.
in config also, but the output ish5g11hwvwhfvt7713fybcfuisiwiggfiwfarwtrtnifrrleiixwtfhfmjafguiuwginnve mam.u a 5 1r v .w lzromrseaver medical internal tilled. 909 797 8922 0612812016 11124 91946 p.0051006
, so still problem exists. Can you help with this ? – Vibhor BhatnagarFrom:Beaver Medical Internal Med. 909 797 8922 06/28/2016 11:24 #946 RODS/006
- from question above is thetesseract
output apart from the leading line of garbage characters? If so, it looks theOCR
accuracy still need to improve as there is still some incorrect characters recognized? – thewayweweretesseract
output along with the leading line of garbage characters. What can be done in this case ? – Vibhor Bhatnagar