1
votes

I have been using tesseract (Ver 3) on linux to extract text from scanned pdf files. The problem that the whole process is slow, very slow. For example, extracting this (http://www.a-pdf.com/scan-paper/a-pdf-scan-paper-doc.pdf) 20 page document takes 514 seconds (8+ min)

to convert the pdf I used Image Magick convert application. bellow the set command that I use.

convert -density 288 src.pdf -colorspace Gray -depth 8 -alpha off tmp.tif

tesseract tmp.tif out.txt

Note, that that 288 dpi is required since otherwise tesseract fails completely in extracting text from the scaned file that I tested.

Does any one know how I can speed things up without effect the quality of the result?

1

1 Answers

0
votes

Try VietOCR to see if it could produce faster results as you want. It can accept PDF if Ghostscript is installed.