3
votes

I have a folder (MY_FILES) that has around 500 files and each day a new file arrives and it's placed there. Size of each file is around 4Mb.

I've just developed a simple 'void main' to test if I can search for a specific wildcard in those files. It works just fine.

Problem is that I'm deleting the old indexed_folder and reindex again. This takes a lot of time and obviously is inefficient. What I'm looking for is an 'incremental indexing'. Meaning, if the index exists already - just add the new files to the index.

I was wondering if Lucene has some kind of mechanism to check if the 'doc' was indexed before trying to index it. Something like writer.isDocExists?

Thanks!

My code looks like this:

       // build the writer
       IndexWriter writer;
       IndexWriterConfig indexWriter = new IndexWriterConfig(Version.LUCENE_36, analyzer);
       writer = new IndexWriter(fsDir, indexWriter);
       writer.deleteAll();  //must - otherwise it will return duplicated result 
       //build the docs and add to writer
       File dir = new File(MY_FILES);
       File[] files = dir.listFiles();
       int counter = 0;
       for (File file : files) 
       { 
           String path = file.getCanonicalPath();
           FileReader reader = new FileReader(file);
           Document doc = new Document();  
           doc.add(new Field("filename", file.getName(), Field.Store.YES, Field.Index.ANALYZED));
           doc.add(new Field("path", path, Field.Store.YES, Field.Index.ANALYZED));
           doc.add(new Field("content", reader));  

           writer.addDocument(doc);
           System.out.println("indexing "+file.getName()+" "+ ++counter+"/"+files.length);
       }
2

2 Answers

5
votes

First, you should use IndexWriter.updateDocument(Term, Document) instead of IndexWriter.addDocument to update documents, this will prevent your index from containing duplicated entries.

To perform incremental indexing, you should add the last-modified time stamp to the documents of your index, and only index documents that are newer.

EDIT: more details on incremental indexing

Your documents should have at least two fields:

  • the path of the file
  • the time stamp when the file has been modified for the last time.

Before starting indexing, just search your index for the latest time stamp and then crawl your directory to find all files whose time stamp is newer than the newest time stamp of the index.

This way, your index will be updated every time a file changes.

2
votes

If you want to check if your document is already present in the index, one method could be to generate the associated Lucene query which you will use with an IndexSearcher to search the Lucene index.

For instance, here, you can build a query using the fields filename, path and content to check whether the document is already present in the index.

You will need an IndexSearcher besides your IndexWriter and follows the Lucene query syntax to generate the full text query you will provide to Lucene (e.g.

 filename:myfile path:mypath content:mycontent

).

IndexSearcher indexSearcher = new IndexSearcher(directory);

String query = // generate your query

indexSearcher.search(query, collector);

In the code above, collector contains a callback method collect which will be called with a document id if some data in the index matches the query.