I have been using tesseract (Ver 3) on linux to extract text from scanned pdf files. The problem that the whole process is slow, very slow. For example, extracting this (http://www.a-pdf.com/scan-paper/a-pdf-scan-paper-doc.pdf) 20 page document takes 514 seconds (8+ min)
to convert the pdf I used Image Magick convert application. bellow the set command that I use.
convert -density 288 src.pdf -colorspace Gray -depth 8 -alpha off tmp.tif
tesseract tmp.tif out.txt
Note, that that 288 dpi is required since otherwise tesseract fails completely in extracting text from the scaned file that I tested.
Does any one know how I can speed things up without effect the quality of the result?