6
votes

The setup of my (web) app is the following: I get user uploaded PDF files, I run OCR on them and show them the OCRed PDF. Since everything is online, the minimizing the size of the resulting PDF file is key to reduce loading and wait time for the user.

The file I receive from the user is sample.pdf (I've created an archive with the original files as well as those that I generate here: https://dl.dropboxusercontent.com/u/1390155/tess-files/sample.zip). I use tesseract 3.04 and do the following:

gs -r300 -sDEVICE=tiff24nc -dBATCH -dNOPAUSE -sOutputFile=sample.tiff sample.pdf
tesseract sample.tiff sample-tess -l fra -psm 1 pdf

The result of the OCR is good, but the size of the generated PDF is now about 2.5 times as much

  • size of original pdf file: 60k
  • size of final pdf: 147K

So I ask you, how can I reduce the size of the generated PDF while keeping the OCR result?

One obvious solution is to reduce the resolution when generating the tiff, but I don’t want to do that as it may affect the OCR result.

The second thing I tried was to reduce the PDF size post-tesseract, using ghostscript:

gs -o sample-down-300.pdf   -sDEVICE=pdfwrite   -dDownsampleColorImages=true \
   -dDownsampleGrayImages=true   -dDownsampleMonoImages=true  \
   -dColorImageResolution=300   -dGrayImageResolution=300  \
   -dMonoImageResolution=300   -dColorImageDownsampleThreshold=1.0  \
   -dGrayImageDownsampleThreshold=1.5   -dMonoImageDownsampleThreshold=1.0 \
    sample-tess.pdf 

This helps a bit, the generated file is only 101K, so about 1.5 times the original. I could live with that, but it also seems to affect the OCR result. For example, the white space between ‘RESTAURANT’ and ‘PIZZERIA’ (second line) is now missing.

Another (simpler) option with ghostscript, using the ebook parameter, results in a 43k file with some lesser quality in the PDF and the same problem of the missing white spaces:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook \
    -dNOPAUSE -dBATCH  -dQUIET -sOutputFile=sample-ebook.pdf \
     sample-tess.pdf

The lesser quality of the PDF is fine, but again, I don’t really want to compromise on the OCR.

I’ve done other tests using PNG and JPEGs, but the OCR results always go down (even slightly) and the resulting PDF is not smaller. For example, with PNG:

convert -density 300 sample.pdf -transparent white sample.png
tesseract sample.png sample-tess-png -l fra -psm 1 pdf

The total (55.50) is missing and the final PDF size is 149k.

So to summarize, here are my questions:

  • Can someone explain why reducing the size of the PDF using ghostscript affects the OCR result? I thought the text layer and the image layer were independent...
  • Are there options that one can give to tesseract to reduce the quality of the images when it generates the PDF?
  • I read that other solutions like ABBYY OCR use Mixed Rasterized Content (MRC) to reduce the file size. Does tesseract do that already? If not, are there some open source or proprietary CLI tools that do that, which I could use to reduce the tesseract generated PDF file?

Again, I’m OK compromising on the quality of the PDF images (although I would like to keep the colors, ideally) as long as the user can search text and select it to copy/paste from the PDF.

Any help greatly appreciated!

3
You're generating tiff24nc files. Did you also try with tiffg4 and compare the results?Kurt Pfeifle
I opened up a new issue to implement the feature you are looking for in a tool which I wrote which is a wrapper around tesseract. Hopefully I can get to it soon. Here it is: github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/issues/5.Gabriel Staples

3 Answers

1
votes

Problem 1, I can't see any file 'attached' to this, so I'm guessing in the dark.

There is no 'text layer' or 'image layer' in PDF, PDF may have layers but that's independent. Text and images are embedded in the file 'as is'. Of course, the result of rendering the PDF to a TIFF image does produce a single image file.

The original PDF will have the text stored as text, using fonts, the TIFF file will have the whole lot rendered as an image. I am unsure exactly how tesseract works, and without an example of its output I can't be certain, but I expect that what it does is leave the rendered image intact in the output PDF file, and add text using render mode 3 (neither stroke nor fill, ie invisible). This is what you have described as 'MCR' above.

What this means for you is that the original PDF is small, because much (perhaps all) of the content is described as vector data. The resulting TIFF file is large because its a full page bitmap, the savings gained by using vector representation have been lost. This is then converted to a PDF (so still large) and then more text and fonts are added to the document, which of course only increases its size.

The only thing which is going to make a substantial difference to the size of that file, realistically, is to reduce the size of the bitmap image, ie the TIFF file which you use to create the final output PDF.

Messing with the original PDF file before rendering to TIFF and OCR seems unlikely to make any difference to the final PDF file size (caveat; compression may work better because there may be more areas of 'flat' colour)

Without seeing the original file and the final file I can't really say much more, and I'm not in a position to test it myself (I don't have Tesseract installed) but it seems to me that the only real solution is to have Tesseract downsample the image before creating the final output PDF file.

1
votes

Firstly, Tesseract is an OCR engine. You can't expect any of the functions it has other than OCR to be optimized. It does OCR very well, not the other stuff. It does do other stuff, for example it thresholds whatever image you give it if not already thresholded (using Otsu method) but you'd have better results by thresholding the image yourself first and then passing it to Tesseract, assuming you have an idea about what you're giving it.

None of this is a Tesseract issue. The reason the whitespace is changing is due to the PDF viewer guessing at the word/line spaces since these are not encoded. If the text is the same and the spacing is disturbed it's entirely a PDF viewer issue. The reason it's changing between PDFs is because you're changing the resolution/canvas size and that interferes with the word/line spacing calculations by the PDF viewer. To compare you can look at the content object for any of the pages in Adobe Acrobat, it's under Preflight | Options | Browse Internal PDF Structure.

The first question I would ask is why the images in the PDF are modified at all? Surely they should not be, they should be exactly the same images you started with, just with the text layer (yes text layer, it's text and it's layered over the image = text layer) inserted invisibly over the top. You can use "Browse Internal PDF Structure" (or Notepad) to look at the size of any of the image objects and see if they are the same size. If not you want to stop them from being changed, or you want to save them and then replace them in the final PDF.

Otherwise perhaps the text is not compressed. PDF supports Deflate. No doubt there's a setting in Ghostscript or PDFTK to compress all the content objects.

You should certainly not have to reduce the quality of the images in the PDF. If I was one of your users/customers I don't think I'd be happy that what you gave me back was not the same as what I gave you - that would make your service useless.

1
votes

Since you use Tesseract 3.04, it supports various compression modes that you may want to check out.

  --force-transcode=[true|false]
  --force-lossless=[true|false]
  --force-compression-algorithms=[dct|flate|g4|lzw|jpx|jbig2]

Issue 1285, 1300.