1
votes

Is there a way to force Tesseract to do OCR only and leave the original images intact? At the moment, I use the command:

tesseract -l eng file.tif file pdf

in order to produce file.pdf from a multipage tif file. My problem with this command is that Tesseract modifies the images. For example, thin lines that denote tables or some figures are removed. I'd like to stop this behavior and only OCR the document where the text is underlaid on the original image. In case it matters,

$ tesseract -v
tesseract 3.03
 leptonica-1.71
  libgif 4.1.6(?) : libjpeg 6b : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.1.0

and

$ cat /usr/share/tessdata/configs/pdf
tessedit_create_pdf 1
tessedit_pageseg_mode 1
1

1 Answers

1
votes

Using the current git repo of Tesseract, the resulting images look much better. Specifically:

$ ./tesseract -v
tesseract 3.04.00
 leptonica-1.71
  libgif 4.1.6(?) : libjpeg 6b : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.1.0

and

git log -n 1
commit 941d87057e67d18aca2ed428543e7f24bbdba010
Author: Ray Smith <[email protected]>
Date:   Wed May 13 17:46:58 2015 -0700

    Fixed training build

with

$ git branch
* master

Basically, all of the lines that used to be eliminated in 3.03 from tables and figures now remain. That being said, the image still is manipulated and the resolution is lower than the original image. Nevertheless, for my purposes, things look ok.