4
votes

I'm using Apache Tika App on my Ubuntu 16.04 Server as a comand line tool to extract content of documents.

The [Apache Tika website][1] says the following:

Build artifacts

The Tika build consists of a number of components and produces the following main binaries:

tika-core/target/tika-core-*.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 6.

tika-parsers/target/tika-parsers-*.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.

tika-app/target/tika-app-*.jar Tika application. Combines the above components and all the external parser libraries into a single runnable jar with a GUI and a command line interface.

So I have downloaded the last verstion (1.18) of tika-app-*.jar. That was just a single file.

Running this in a command line like java -jar tika-app-1.18.jar -t <filename> gives me the needed output of the file content but also each time I get two warnings:

July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version.

I don't know if those warning slow things down but it is hard to follow other output amongst those repetative warnings.

I have tried to point Tika to my own configuration file by:

java -jar tika-app-1.18.jar --config=tika-config.xml -t <filename>

My tika-config.xml file is:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/x-sqlite3</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
    </parser>
  </parsers>
</properties>

If I use that config I get No protocol: filename.doc and the warnings are still in place.

How to exclude jpeg and sqlite parsers?

1
@Gagravarr Thank you, no I didn't read that. So based on that I'm correctly feeding the configuration file. I can probably use ` <mime-exclude>image/jpeg</mime-exclude>` to avoid images to be parsed. I would probably need a default config file, do I still use content of POM.XML? And sqlite parsers probably gets excluded the same way as images, correct? - user164863
You only need pom.xml if you are compiling Tika yourself, which you don't need to do when configuring the app! - Gagravarr
@Gagravarr Ok, I get it. But I try to make a config file just with the first exmple on how parsers can be configured and then I do java -jar tika-app-1.18.jar --config=tika-config.xml -t <filename> and I get No protocol: filename.doc And then what is mime type for sqlite files? - user164863
Those warnings come at initialisation time, you're excluding things at parse time. You probably just want to follow tika.apache.org/1.18/configuring.html#Load_Error_Handling to turn off the warnings - Gagravarr

1 Answers

3
votes

My solution was this tika-config.xml file:

 <?xml version="1.0" encoding="UTF-8"?>
 <properties>
   <service-loader loadErrorHandler="IGNORE"/>
   <service-loader initializableProblemHandler="ignore"/>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
    <mime-exclude>image/jpeg</mime-exclude>
    <mime-exclude>application/x-sqlite3</mime-exclude>
    <parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
   </parser>
  </parsers>
  </properties>

and then set:

export TIKA_CONFIG=/path/to/tika-config.xml

in my .bashrc file.