2
votes

I've successfully installed Solr 1.4.1, but I can't get Tika 0.4 (which is included in contrib/extraction) to work correctly. I'm getting a 404 error when attempting to hit http://localhost:8080/solr/ss/update/extract ("ss" is my core).

I've moved all of the contrib/extraction jars into the WEB-INF directory of Solr after it has been deployed, as well as the "solr-cell" jar that resides in the "dist" directory.

The method I used above worked for Solr 3.3, but PDF parsing is broken in Tika 0.8, so I decided to revert back to Solr 1.4.1 and Tika 0.4.

I'm using Tomcat 7.0, if that helps.

2

2 Answers

2
votes

I resolved the issue.

I had copied the multicore directories ("core0" and "core1" in example/multicore) and they were using VERY STRIPPED DOWN versions of solrconfig.xml. I referred to the default example (located in example/solr) and grabbed the "requestHandler" section for "update/extract" and placed it in the stripped down version of my solrconfig.xml and restarted the Solr web app within Tomcat and now file parsing works perfectly.

I hope this helps someone else.

2
votes

I've been using django_haystack with Solr 5.3.1, and when customizing schema.xml and experienced the same problem, I would like to add to Travis' answer.

The lines you need to add in solrconfig.xml are the following:

Under lucene version definition

<luceneMatchVersion>5.3.1</luceneMatchVersion>

Add these library imports (I have taken them from example files):

<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/contrib/dataimporthandler/lib/" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />

<lib dir="${solr.install.dir:../../../..}/contrib/clustering/lib/" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-clustering-\d.*\.jar" />

<lib dir="${solr.install.dir:../../../..}/contrib/langid/lib/" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-langid-\d.*\.jar" />

<lib dir="${solr.install.dir:../../../..}/contrib/velocity/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-velocity-\d.*\.jar" />

Then add requestHandler for /update/extract near any already defined requestHandler

<requestHandler name="/update/extract"
  startup="lazy"
  class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>

    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
  </lst>
</requestHandler>

I hope that helps.