My project revolves around Lucene 6.6.0. Actually it deals with a desktop search engine written in java, where the search part is in a separate app from the indexing part. From time to time I have to add new fields to the index to meet customer needs WITHOUT having to reindex (ie parse files + indexing) everything.
Consequently when the application starts I take the IndexWriter, open an IndexReader associated with it :
IndexReader reader = DirectoryReader.open(writer, true, false);
Then for each document already present in the index :
StoredField fieldVersion = new StoredField(
FIELDNAME_VERSION,
fixedValue // The value is the same for all the documents but may change (for all documents) when upgrading the version.
);
for (int i = 0; i < idMax; i++) {
Document currentDoc = reader.document(i);
// Checks if the field exists in the index
if (
// Field does not exist yet
currentDoc.get(FIELDNAME_VERSION) == null ||
// Field value is different from what it should be
!currentDoc.get(FIELDNAME_VERSION).contentEquals(fixedValue))
{
// THe field does not exist so we add it to the doc and beforehand removes the field from the currentDoc (also tried without removing first with the same result)
currentDoc.removeField(FIELDNAME_VERSION);
currentDoc.add(fieldVersion);
// Updates the document in the index
writer.updateDocuments(
new Term(FIELDNAME_PATH, currentDoc.get(FIELDNAME_PATH),
currentDoc);
// also tried with
writer.deleteDocuments(new Term(FIELDNAME_PATH,
currentDoc.get(FIELDNAME_PATH)));
writer.addDocument(currentDoc);
}
}
// When all documents have been checked, write the index
writer.commit();
When I first run this the field is added to all documents that did not have it, as expected. The problem is that when the fixedValue changes a new document is added to the index whereas I expected the currentDoc to update its fieldVersion, not to create another Document with same values as original for all fields but the fieldVersion.
The IndexWriter is in append mode (also tried with append or create). And if I first index a single file, I get 1 document in the index, then following an index update, I get 2 documents, then 4, then 8, then 16, ... always refering to the same single file (only fieldVersion has a different content).
This other SO question did not help me.
Why is Lucene adding a new document when I ask it to update the existing document, and what should be done to workaround this (ie replace the existing document with the same document simply with a different content for fieldVersion ?
EDIT 1 :
It looks like after calling this method a field is missing. This field is initialized via :
new TextField(FIELDNAME_UNSTORED,
"",
Field.Store.NO);
So it is not stored.
The field associated with FIELDNAME_PATH is initialized as
StringField pathField = new StringField(FIELDNAME_PATH,
"",
Field.Store.YES);
EDIT 2 :
What I don't actually get is that if I only do a deleteDocuments(new Term(...))
then all documents are removed from index (as expected), but if I add after the delete add(currentDoc)
then I get twice as much documents. As if the document was added once in its original version and a second time in its updated version.
SOLUTION :
As pointed out by @femtoRgon, the path field was not tokenized during the indexing process. But then it got automagically tokenized. So the solution is to recreate the path field (and the other ones) as during the indexing, use a temporary Document
to store the fields, and then updateDocument()
with this temporary Document
.
Any help very much appreciated!