3
votes

Recently we started to explore Solr partial index updates.

API for full and partial updates looks similar. Instead of

doc.addField("location", "UK")
solrClient.add(doc)

you have to write

doc.addField("location", map("set", "Germany"))
solrClient.add(doc)

What I expected to happen: solr will update inverted index for field "location"

What actually happens:

  • solr loads stored fields for document
  • applies given updates for document
  • deletes document by id
  • writes document to index

As the result, all non-stored fields are lost.

I found some old discussions in mailing lists, people say that this is expected behaviour, you need to make all fields stored and so on. We don't want to make all fields stored. "Stored" property was designed for fields that need to be returned in response from Solr to caller. We need only small meta-info in responses, making all fields stored looks like an overkill.

The question is - why solr/lucene performs all these steps to execute partial update? In my understanding, every field has its own inverted index located in its own file, so it should be possible to update fields independently. Judging by what really happens, solr/lucene is unable to update index for a single field and I can't find a reason for that.

Discussions on this topic:

1

1 Answers

4
votes

Your observation is correct- that is the behavior. The reason is that there are factors that can depend on other fields (through the copyField directive, for example), how fields are being merged (position increments, etc.), and is why a partial update is only possible with stored fields - the document is simply loaded, the value for that specific field is manipulated, and then indexed again.

The fields does not have their own files for their indexes - it's one set of files for the complete index, and the index is only appended - documents are not changed in-place in this index (so a document is only marked as deleted, and then the new document is appended to the index). When you run optimize on the index, the index is rewritten without the deleted documents present.

There is a way around this, and if your field fills a set of conditions, an in-place update can be performed instead. This does what you're asking for.

In-place updates are very similar to atomic updates; in some sense, this is a subset of atomic updates. In regular atomic updates, the entire document is reindexed internally during the application of the update. However, in this approach, only the fields to be updated are affected and the rest of the documents are not reindexed internally. Hence, the efficiency of updating in-place is unaffected by the size of the documents that are updated (i.e., number of fields, size of fields, etc.). Apart from these internal differences, there is no functional difference between atomic updates and in-place updates.

However, the requirements may not match your use case - i.e. they have to be non-indexed and numeric (as it's the docValue in the background that's being replaced, not the content in the index - where this operation in general isn't possible - the index is append only):

An atomic update operation is performed using this approach only when the fields to be updated meet these three conditions:

  • are non-indexed (indexed="false"), non-stored (stored="false"), single valued (multiValued="false") numeric docValues (docValues="true") fields;
  • the version field is also a non-indexed, non-stored single valued docValues field; and,
  • copy targets of updated fields, if any, are also non-indexed, non-stored single valued numeric docValues fields.

To use in-place updates, add a modifier to the field that needs to be updated. The content can be updated or incrementally increased.