Is there a way to force Tesseract to do OCR only and leave the original images intact? At the moment, I use the command:
tesseract -l eng file.tif file pdf
in order to produce file.pdf
from a multipage tif file. My problem with this command is that Tesseract modifies the images. For example, thin lines that denote tables or some figures are removed. I'd like to stop this behavior and only OCR the document where the text is underlaid on the original image. In case it matters,
$ tesseract -v
tesseract 3.03
leptonica-1.71
libgif 4.1.6(?) : libjpeg 6b : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.1.0
and
$ cat /usr/share/tessdata/configs/pdf
tessedit_create_pdf 1
tessedit_pageseg_mode 1