0
votes

SOLR has a module called Cell. It uses Tika to extract content from documents and index it with SOLR.

From the sources at https://github.com/apache/lucene-solr/tree/master/solr/contrib/extraction , I conclude that Cell places the raw extracted text document text into a field called "content". The field is indexed by SOLR, but not stored. When you query for documents, "content" doesn't come up.

My SOLR instance has no schema (I left the default schema in place).

I'm trying to implement a similar kind of behavior using the default UpdateRequestHandler (POST to /solr/corename/update). The POST request goes:

<add commitWithin="60000">
    <doc>
        <field name="content">lorem ipsum</field>
        <field name="id">123456</field>
        <field name="someotherfield_i">17</field>
    </doc>
</add>

With documents added in this manner, the content field is indexed and stored. It's present in query results. I don't want it to be; it's a waste of space.

What am I missing about the way Cell adds documents?

2

2 Answers

2
votes

If you don't want your field to store the contents, you have to set the field as stored="false".

Since you're using the schemaless mode (there still is a schema, it's just generated dynamically when new fields are added), you'll have to use the Schema API to change the field.

You can do this by issuing a replace-field command:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "replace-field":{
  "name":"content",
  "type":"text",
  "stored":false }
}' http://localhost:8983/solr/collection/schema

You can see the defined fields by issuing a request against /collection/schema/fields.

1
votes

The Cell code indeed adds the content to the document as content, but there's a built-in field translation rule that replaces content with _text_. In the schemaless SOLR, _text_ is marked as not for storing.

The rule is invoked by the following line in the SolrContentHandler.addField():

String name = findMappedName(fname);

In the params object, there's a rule that fmap.content should be treated as _text_. It comes from corename\conf\solrconfig.xml, where by default there's the following fragment:

<requestHandler name="/update/extract"
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="fmap.meta">ignored_</str>
    <str name="fmap.content">_text_</str> <!-- This one! -->
  </lst>
</requestHandler>

Meanwhile, in corename\conf\managed_schema there's a line:

<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>

And that's the whole story.