1
votes

Currently I am able to add list of documents as well as individual documents into the apache lucene Index. But I am facing the problem while updating the document from the Index:

Mine approach is as soon as files are uploaded, so before writing to the disk i am checking whether files exist in the drive/folder and deleting the index based on the file name.

Secondly i am adding the uploaded file into the Lucene Index.

But the Problem i am encountering is the newly added as well as the old documents both are showing in the search result with different contents.

For Ex: The file name is Sample_One.txt with text :

This is the sample text for first time.

After deleting the above file from the Index and then adding the new File content into the Index.

Now the File contents are updated with another text with same file name:

This is the sample text with updated content.

While Searching some text like "sample" , The result is showing Sample_One.txt file two times with Old and new contents.

I want to know do i am missing something and how to update/delete the document into the Index.

Code Snippets are:

//Deleting the Document from the Index
public void deleteDocumentsFromIndexUsingTerm(Document doc) throws IOException, ParseException {
    Term fileTerm = new Term("file_name",doc.get("file_name"));
    Term contentTerm = new Term("content", doc.get("content"));
    Term docIDTerm = new Term("document_id", doc.get("document_id"));

    File indexDir = new File(INDEX_DIRECTORY);

    Directory directory = FSDirectory.open(indexDir.toPath());

    Analyzer analyzer = new StandardAnalyzer();
    IndexWriterConfig conf = new IndexWriterConfig(analyzer);
    IndexWriter indexWriter = new IndexWriter(directory, conf);

    System.out.println("Deleting the term with - "+doc.get("file_name"));
    System.out.println("Deleting the term with contents - "+doc.get("content"));

    indexWriter.deleteDocuments(fileTerm);
    indexWriter.deleteDocuments(contentTerm);
    indexWriter.deleteDocuments(docIDTerm);
    indexWriter.commit();
    indexWriter.close();
}

// Snippet to add document to the index

final String INDEX_DIRECTORY = "D:\\Development\\Lucene_Indexer";
    long startTime = System.currentTimeMillis();
    List<ContentHandler> contentHandlerList = new ArrayList<ContentHandler>();

    String fileNames = (String)request.getAttribute("message");

    File file = new File("D:\\Development\\Resume_Sample\\"+fileNames);

    ArrayList<File> fileList = new ArrayList<File>();
    fileList.add(file);

    Metadata metadata = new Metadata();

    // BodyContentHandler set the value as -1 to evade the Text Limit Exception
    ContentHandler handler = new BodyContentHandler(-1);
    ParseContext context = new ParseContext();
    Parser parser = new AutoDetectParser();
    InputStream stream = new FileInputStream(file);

    try {
        parser.parse(stream, handler, metadata, context);
        contentHandlerList.add(handler);
    }catch (TikaException e) {
        e.printStackTrace();
    }catch (SAXException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    finally {
        try {
            stream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    FieldType fieldType = new FieldType();
    fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    fieldType.setStoreTermVectors(true);
    fieldType.setStoreTermVectorPositions(true);
    fieldType.setStoreTermVectorPayloads(true);
    fieldType.setStoreTermVectorOffsets(true);
    fieldType.setStored(true);


    Analyzer analyzer = new StandardAnalyzer();
    Directory directory = FSDirectory.open(new File(INDEX_DIRECTORY).toPath());
    IndexWriterConfig conf = new IndexWriterConfig(analyzer);
    IndexWriter writer = new IndexWriter(directory, conf);

    Iterator<ContentHandler> handlerIterator = contentHandlerList.iterator();
    Iterator<File> fileIterator = fileList.iterator();

while (handlerIterator.hasNext() && fileIterator.hasNext()) {
    Document doc = new Document();

    String text = handlerIterator.next().toString();
    String textFileName = fileIterator.next().getName();

    String idOne = UUID.randomUUID().toString();

    Field idField = new Field("document_id",idOne,fieldType);
    Field fileNameField = new Field("file_name", textFileName, fieldType);
    Field contentField = new Field("content",text,fieldType);


    doc.add(idField);
    doc.add(contentField);
    doc.add(fileNameField);

    writer.addDocument(doc);

    analyzer.close();
}

writer.commit();
writer.deleteUnusedFiles();
long endTime = System.currentTimeMillis();

writer.close();

Above firstly I am deleting the document as soon as files are uploaded and then indexing the updated document.

1

1 Answers

2
votes

The problem is your fields are being analyzed when indexed, but the terms you are trying to delete with are not analyzed.

The best solution would be to make whichever field you want to use as an identifier for this purpose a StringField, which will cause it to be indexed without analysis. Such as:

Field idField = new StringField("document_id", idOne);
doc.add(idField);

Alternatively, you could use IndexWriter.deleteDocuments(Query...), and pass in an analyzed query (generated by the QueryParser), though in that case you should be careful not to delete more documents than you intended to (any documents found by the query will be deleted, not just the best result).