0
votes

I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).

I'm able to remove all deleted document by doing an optimize

curl http://localhost:8983/solr/core_name/update?optimize=true

But this takes hours to run and requires a lot of RAM and disk space.

Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?

Thanks for your help!

1
Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.yoann

1 Answers

2
votes

Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.

When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.

How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.

If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.