I am using ghostscript and tesseract to extract text data from scanned PDFs. But the scan result for some part of the pdf is not accurate. For testing purpose, I am taking screenshot of pdf and passing it to tesseract. Below is the scenarios and the problem I'm facing.
Scenario 1:
Link to Screenshot: https://dl.dropbox.com/u/9409594/scenario_1.tif
Once I pass this image (screenshot from a 125% zoomed pdf) to tesseract, below is the result text I'm getting:
ART\CLE STANDARD NUMBER PFUCE
Scenario 2:
Link to screenshot: https://dl.dropbox.com/u/9409594/scenario_2.tif
If I pass the above screenshot (300% zoom) to tesseract, result is good.
ARTICLE NUMBER
Below are the arguments I'm using with ghostscript and tesseract:
Ghostscript: gswin64.exe -dNOPAUSE -dBATCH -dSAFER -sDEVICE=tifflzw -r600 -sOutputFile="C:\test\output.tiff" "C:\test\input.pdf"
Tesseract: tesseract.exe "c:\test\output.tif" "c:\test\output.html" -l eng -psm 6 hocr
From my testing, I feel that if a zoomed version of image is passed to tesseract, result is good. Can I zoom the image using ghostscript before converting it into image? Or is there a better way to do this?
Appreciate your time and help!