How to index a pdf / word doc in Apache SolR

Question

I am new to big data environment, hence apologizing first if the below query is meaningless.

I want to read a word / pdf document and index those documents in SolR . I understand that SolR accepts a JSON or XML format and not a word / pdf /txt files. Is it necessary to convert a word / pdf document into JSON or XML before sending the document to SolR? I initially thought I should use Tika, but my understanding is that Tika can convert a pdf to text and not to JSON.

Could you please guide how to index in Solr?

Thanks for the help

please read the Documentation - the JSON / XML Format is merly a description of the file you are submitting — user1859022
@user1859022 - Thank you. I could only index the meta data of documents alone. However I am not able to index the actual content from these documents. Is there anyway that the actual content can be extracted? — Sijo K

MatsLindh MatsLindh · Accepted Answer · 2016-08-11T11:53:44

The standard endpoint for indexing 'rich files' are at update/extract, so if you post your file to that destination, Solr will run it through Tika internally, extract the text and properties. You can provide literal values through the URL (such as an ID, filename, other metadata) with literal.fieldname=value arguments.

The Uploading Data with Solr Cell using Apache Tika description in the manual gives you a low-level introduction to how to submit documents with curl through HTTP, as well as which configuration options are required to enable automagic extraction (which is enable on a few of the examples (data driven, tech products iirc)):

If you are not working with the supplied sample_techproducts_configs or data_driven_schema_configs config set, you must configure your own solrconfig.xml to know about the Jar's containing the ExtractingRequestHandler and it's dependencies:

<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />`
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />

You can then configure the ExtractingRequestHandler in solrconfig.xml.

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="fmap.Last-Modified">last_modified</str>
    <str name="uprefix">ignored_</str>
  </lst>
  <!--Optional.  Specify a path to a tika configuration file. See the Tika docs for details.-->
  <str name="tika.config">/my/path/to/tika.config</str>
  <!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS
       for default date formats -->
  <lst name="date.formats">
    <str>yyyy-MM-dd</str>
  </lst>
  <!-- Optional. Specify an external file containing parser-specific properties.
       This file is located in the same directory as solrconfig.xml by default.-->
  <str name="parseContext.config">parseContext.xml</str>
</requestHandler>

How to index a pdf / word doc in Apache SolR

1 Answers