Indexing office formats with a custom field type schema

Question

We have the following Solr (3.4) schema for indexing html/text documents:

 <fields>

   <field name="text" type="text" indexed="true"
          stored="true" required="false" multiValued="false"
          omitNorms="false"/>
   <field name="title" type="text" indexed="true"
          stored="true" required="false" multiValued="false"
          omitNorms="false"/>
   <field name="created" type="date" indexed="true"
          stored="true" required="true" multiValued="false"
          omitNorms="false"/>
   <field name="modified" type="date" indexed="true"
          stored="true" required="false" multiValued="false"
          omitNorms="false"/>
   <field name="filesize" type="integer" indexed="true"
          stored="true" required="false" multiValued="false"
          omitNorms="false"/>
   <field name="mimetype" type="string" indexed="true"
          stored="true" required="false" multiValued="false"
          omitNorms="false"/>
   <field name="id" type="string" indexed="true"
          stored="true" required="true" multiValued="false"
          omitNorms="false"/>
   <field name="tag" type="string" indexed="true"
          stored="true" required="false" multiValued="false"
          omitNorms="false"/>
   <field name="relpath" type="string" indexed="true"
          stored="true" required="false" multiValued="false"
          omitNorms="false"/>

   <dynamicField name="tika_*" type="ignored" />

 </fields>

The configurations are auto-generated from templates from the solrinstance recipe for zc.buildout.

Now we need to import/index PDF/Office files etc. into Solr for fulltext indexing.

The generated requestHandler for the extraction is:

  <requestHandler name="/update/extract"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="fmap.text">tika_content</str>
      <str name="lowernames">false</str>
      <str name="uprefix">tika_</str>
    </lst>
  </requestHandler>

But after uploading a PDF file through curl I can not find any indication that it has been index (no changes in the document stats etc.).

What is the trick here?

[Update]

I am using

curl "http://localhost:8983/solr/update/extract?literal.id=2&commit=true&fmap.content=text" -F "[email protected]"

to upload a PDF file. Having adding fmap.content=text seems to do the desired mapping (overriding the generated configuration).

This seems to have solved the problem.

Jayendra Jayendra · Accepted Answer · 2011-11-15T09:16:43

fmap is basically field mapping for the content generated by tika.

Tika handler extracts the content of the document uploaded and assigns it to the field name content. <str name="fmap.content">text</str> maps the content field to the text field defined in the schema. As you have text field defined in the schema, this will work.

However, for <str name="fmap.text">tika_content</str> there is not field tika_content defined nor I think the text gets generated, so would not result in any matches.

Indexing office formats with a custom field type schema

1 Answers