1
votes

I have a PDF that i'm trying to get Tika to parse. The PDF is not OCR. Tesseract is installed on my machine.

I used ImageMagik to convert the file.tiff to file.pdf, so the tiff file I am parsing is a direct conversion from the PDF.

Tika parses the TIFF with no problems, but returns "None" content for the PDF. What gives? I'm using Tika 1.14.1, tesseract 3.03, leptonica-1.70

Here is the code...

from tika import parser

# This works
print(parser.from_file('/from/file.tiff', 'http://localhost:9998/tika'))

# This returns "None" for content
print(parser.from_file('/from/file.pdf', 'http://localhost:9998/tika'))
2
What happens if you try the latest Apache Tika server, 1.17?Gagravarr
@Gagravarr that's kinda my last resort... i'll try the latest server if needed, however having everything running under pip helps keep everything centralized.Jonathan Coe
Is there any error in /tmp/tika-server.log ?Dmitrii Z.

2 Answers

2
votes

So, after some feedback from the Chris Mattman (who was wonderful, and very helpful!), I sorted out the issue.

His response:

Since Tika Python acts as a thin client to the REST server, you just need to make sure the REST server is started with a classpath configuration that sets the right flags for TesseractOCR, see here:

http://wiki.apache.org/tika/TikaOCR

While I had read this before, the issue did not click for me until later and some further reading. TesseractOCR does not natively support OCR conversion of PDF's - therefore, Tika doesn't either as Tika relies on Tesseract's support of PDF conversion (and further, neither does tika-python)

My solution:

I combined subprocess, ImageMagick (CLI) and Tika to work together in python to first convert the PDF to a TIFF, and then allow Tika/Tesseract to perform an OCR conversion on the file.

Notes:

  • This process is very slow for large PDF's
  • Requires: tika-python, tesseract, imagemagick

The code:

from tika import parser
import subprocess
import os

def ConvertPDFToOCR(file):

    meta = parser.from_file(fil, 'http://localhost:9998/tika')

    # Check if parsed content is NoneType and handle accordingly.
    if "content" in meta and meta['content'] is None:

            # Run ImageMagick via subprocess (command line)
            params = ['convert', '-density', '300', u, '-depth', '8', '-strip', '-background', 'white', '-alpha', 'off', 'temp.tiff']
            subprocess.check_call(params)

            # Run Tika again on new temp.tiff file
            meta = parser.from_file('temp.tiff', 'http://localhost:9998/tika')

            # Delete the temporary file
            os.remove('temp.tiff')

    return meta['content']
1
votes

You can enable the X-Tika-PDFextractInlineImages': 'true' and directly extract text from images in the pdfs. No need for conversion. Took a while to figure out but works perfectly.

from tika import parser
headers = {
'X-Tika-PDFextractInlineImages': 'true',
}
parsed = parser.from_file("Citi.pdf",serverEndpoint='http://localhost:9998/rmeta/text',headers=headers)
print(parsed['content'])