Lucene keeps adding documents whereas updateDocument is used

Question

My project revolves around Lucene 6.6.0. Actually it deals with a desktop search engine written in java, where the search part is in a separate app from the indexing part. From time to time I have to add new fields to the index to meet customer needs WITHOUT having to reindex (ie parse files + indexing) everything.

Consequently when the application starts I take the IndexWriter, open an IndexReader associated with it :

IndexReader reader = DirectoryReader.open(writer, true, false);

Then for each document already present in the index :

StoredField fieldVersion = new StoredField(
            FIELDNAME_VERSION,
            fixedValue // The value is the same for all the documents but may change (for all documents) when upgrading the version.
            );

for (int i = 0; i < idMax; i++) {

    Document currentDoc = reader.document(i);
    // Checks if the field exists in the index
    if (
    // Field does not exist yet
    currentDoc.get(FIELDNAME_VERSION) == null || 
    // Field value is different from what it should be
    !currentDoc.get(FIELDNAME_VERSION).contentEquals(fixedValue))
      {
        // THe field does not exist so we add it to the doc and beforehand removes the field from the currentDoc (also tried without removing first with the same result)
        currentDoc.removeField(FIELDNAME_VERSION);
        currentDoc.add(fieldVersion);
       // Updates the document in the index
       writer.updateDocuments(
       new Term(FIELDNAME_PATH, currentDoc.get(FIELDNAME_PATH),
       currentDoc);

       // also tried with 
       writer.deleteDocuments(new Term(FIELDNAME_PATH, 
       currentDoc.get(FIELDNAME_PATH)));
       writer.addDocument(currentDoc);
      }
}
// When all documents have been checked, write the index
writer.commit();

When I first run this the field is added to all documents that did not have it, as expected. The problem is that when the fixedValue changes a new document is added to the index whereas I expected the currentDoc to update its fieldVersion, not to create another Document with same values as original for all fields but the fieldVersion.

The IndexWriter is in append mode (also tried with append or create). And if I first index a single file, I get 1 document in the index, then following an index update, I get 2 documents, then 4, then 8, then 16, ... always refering to the same single file (only fieldVersion has a different content).

This other SO question did not help me.

Why is Lucene adding a new document when I ask it to update the existing document, and what should be done to workaround this (ie replace the existing document with the same document simply with a different content for fieldVersion ?

EDIT 1 :

It looks like after calling this method a field is missing. This field is initialized via :

 new TextField(FIELDNAME_UNSTORED,
            "",
            Field.Store.NO);

So it is not stored.

The field associated with FIELDNAME_PATH is initialized as

StringField pathField = new StringField(FIELDNAME_PATH,
            "",
            Field.Store.YES);

EDIT 2 :

What I don't actually get is that if I only do a deleteDocuments(new Term(...)) then all documents are removed from index (as expected), but if I add after the delete add(currentDoc) then I get twice as much documents. As if the document was added once in its original version and a second time in its updated version.

SOLUTION :

As pointed out by @femtoRgon, the path field was not tokenized during the indexing process. But then it got automagically tokenized. So the solution is to recreate the path field (and the other ones) as during the indexing, use a temporary Document to store the fields, and then updateDocument() with this temporary Document.

Any help very much appreciated!

I'd better create new index with additional fields in a separate folder and when indexing is done delete previous index and rename folder with new index — Ivan
Nice workaround, thanks! While giving it a try I started to find it a bit exagerated (the index weight thenths of GB on a LAN), because in other areas of the program I use updateDocument and it works. Would you mind elaborating on why you would favour creating a new index over updating an existing one ? — HelloWorld

femtoRgon femtoRgon · Accepted Answer · 2018-07-08T08:51:21

Updates in lucene always add a new document, it just deletes any documents matching the given term first, and whether it finds a document to delete or not, it will happily add the new document. So, for whatever reason, you aren't getting a match on that term. You haven't shown how FIELDNAME_PATH is indexed, but for the pattern you have here, it should be indexed and not tokenized (ie. use StringField).

You can test whether the term you are passing to update is going to work simply by running a TermQuery. If you get 0 results from a TermQuery, then IndexWriter.UpdateDocuments isn't going to find the document to delete, either.

As far as missing unstored fields, yes, the pattern you are using here isn't going to play well with unstored fields. Unstored fields are not included in the document returned from IndexReader.document (that's the point of storing a field, so that it will be retrievable from the index). So, since it's missing in the result, it will still be missing from the document you pass into the update, unless you recreate that value in some other way. Either rebuild your document from whatever source materials you are using, or make sure anything you want to be persisted across updates is stored.

Lucene keeps adding documents whereas updateDocument is used

1 Answers