0
votes

I am using ghostscript and tesseract to extract text data from scanned PDFs. But the scan result for some part of the pdf is not accurate. For testing purpose, I am taking screenshot of pdf and passing it to tesseract. Below is the scenarios and the problem I'm facing.

Scenario 1:

Link to Screenshot: https://dl.dropbox.com/u/9409594/scenario_1.tif

Once I pass this image (screenshot from a 125% zoomed pdf) to tesseract, below is the result text I'm getting:

ART\CLE STANDARD NUMBER PFUCE

Scenario 2:

Link to screenshot: https://dl.dropbox.com/u/9409594/scenario_2.tif

If I pass the above screenshot (300% zoom) to tesseract, result is good.

ARTICLE NUMBER

Below are the arguments I'm using with ghostscript and tesseract:

Ghostscript: gswin64.exe -dNOPAUSE -dBATCH -dSAFER -sDEVICE=tifflzw -r600 -sOutputFile="C:\test\output.tiff" "C:\test\input.pdf"

Tesseract: tesseract.exe "c:\test\output.tif" "c:\test\output.html" -l eng -psm 6 hocr

From my testing, I feel that if a zoomed version of image is passed to tesseract, result is good. Can I zoom the image using ghostscript before converting it into image? Or is there a better way to do this?

Appreciate your time and help!

1
Try uncompressed option, such as: -sDEVICE=tiffgray or pnggray. -r300 could be good enough for most cases. - nguyenq

1 Answers

0
votes

You can try this, http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

You may be aware of this, related to taking screen shot, instead of taking screen shot you can try convertion of pdf to tif using convert command of imagemagik or if its multiple page pdf use pdftoppm and then to tif using convert command.