The standard endpoint for indexing 'rich files' are at update/extract
, so if you post your file to that destination, Solr will run it through Tika internally, extract the text and properties. You can provide literal values through the URL (such as an ID, filename, other metadata) with literal.fieldname=value
arguments.
The Uploading Data with Solr Cell using Apache Tika description in the manual gives you a low-level introduction to how to submit documents with curl through HTTP, as well as which configuration options are required to enable automagic extraction (which is enable on a few of the examples (data driven, tech products iirc)):
If you are not working with the supplied sample_techproducts_configs or data_driven_schema_configs config set, you must configure your own solrconfig.xml to know about the Jar's containing the ExtractingRequestHandler and it's dependencies:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />`
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
You can then configure the ExtractingRequestHandler in solrconfig.xml.
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
<str name="uprefix">ignored_</str>
</lst>
<!--Optional. Specify a path to a tika configuration file. See the Tika docs for details.-->
<str name="tika.config">/my/path/to/tika.config</str>
<!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS
for default date formats -->
<lst name="date.formats">
<str>yyyy-MM-dd</str>
</lst>
<!-- Optional. Specify an external file containing parser-specific properties.
This file is located in the same directory as solrconfig.xml by default.-->
<str name="parseContext.config">parseContext.xml</str>
</requestHandler>