How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

Question

I'm trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius package (using this tutorial here: http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html) but the code throws an error.

Using the TIKA package however I was able to pass files and parse them but Python is only able to extract metadata and when asked to parse content, Python returns output "none". It is able to perfectly parse .txt files but fails for content extraction for PDFs. Here's the code

import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('/path/to/file')
print parsed["metadata"]
print parsed["content"]

Do I require additional packages/codelines to be able to extract the data?

Is there actually any text in your PDFs? Computers are dumb. What looks like text for you, me, and everyone else, may be just a couple of random lines to a computer. — Jongware
The text that exists in the PDFs has been scanned in and does not exist as actual characters. Essentially it is a just labels included on a typical engineering drawing(much like this one: 7-plus-ngm.org/bilder/piston.jpg) I need to be able to extract the label data, description tables and notes included in the example image — Abhishek.A
Then you cannot use a general text extractor; you must use OCR here (Optical Character Recognition). — Jongware
NOTE: I tried passing PDFs that contain only text, even .doc files converted to .pdf and the code still returns "None" as an output for comment. So I wonder if there is something wrong with the package itself and requires other dependencies to make it work properly? — Abhishek.A
Apache Tika supports OCR'ing text, if you have the right tools installed. Do you try following the Tika OCR setup instructions? — Gagravarr

HakunaMaData HakunaMaData · Accepted Answer · 2016-04-14T16:16:07

You need to download the Tika Server Jar and run it first. Check this link: http://wiki.apache.org/tika/TikaJAXRS

Download the Jar
Store it somewhere and run it as java -jar tika-server-x.x.jar --port xxxx
In your Code you now don't need to do the tika.initVM() Add tika.TikaClientOnly = True instead of tika.initVM()
Change parsed = parser.from_file('/path/to/file') to parsed = parser.from_file('/path/to/file', '/path/to/server') You will get the server path in Step 2. when the tika server initiates - just plug that in here

Good luck!

How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

5 Answers