How does SOLR Cell add document content?

Question

SOLR has a module called Cell. It uses Tika to extract content from documents and index it with SOLR.

From the sources at https://github.com/apache/lucene-solr/tree/master/solr/contrib/extraction , I conclude that Cell places the raw extracted text document text into a field called "content". The field is indexed by SOLR, but not stored. When you query for documents, "content" doesn't come up.

My SOLR instance has no schema (I left the default schema in place).

I'm trying to implement a similar kind of behavior using the default UpdateRequestHandler (POST to /solr/corename/update). The POST request goes:

<add commitWithin="60000">
    <doc>
        <field name="content">lorem ipsum</field>
        <field name="id">123456</field>
        <field name="someotherfield_i">17</field>
    </doc>
</add>

With documents added in this manner, the content field is indexed and stored. It's present in query results. I don't want it to be; it's a waste of space.

What am I missing about the way Cell adds documents?

MatsLindh MatsLindh · Accepted Answer · 2016-10-31T16:31:00

If you don't want your field to store the contents, you have to set the field as stored="false".

Since you're using the schemaless mode (there still is a schema, it's just generated dynamically when new fields are added), you'll have to use the Schema API to change the field.

You can do this by issuing a replace-field command:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "replace-field":{
  "name":"content",
  "type":"text",
  "stored":false }
}' http://localhost:8983/solr/collection/schema

You can see the defined fields by issuing a request against /collection/schema/fields.

How does SOLR Cell add document content?

2 Answers