incremental indexing lucene

Question

I'm making an application in Java using Lucene 3.6 and want to make an incremental rate. I have already created the index, and I read that you have to do is open the existing index, and check each document indexing and document modification dates to see if they differ delete the index file and re-add again. My problem is I do not know how to do that in Java Lucene.

Thanks

My code is:

public static void main(String[] args) 
    throws CorruptIndexException, LockObtainFailedException,
           IOException {

    File docDir = new File("D:\\PRUEBASLUCENE");
    File indexDir = new File("C:\\PRUEBA");

    Directory fsDir = FSDirectory.open(indexDir);
    Analyzer an = new StandardAnalyzer(Version.LUCENE_36);
    IndexWriter indexWriter
        = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);


    long numChars = 0L;
    for (File f : docDir.listFiles()) {
        String fileName = f.getName();
        Document d = new Document();
        d.add(new Field("Name",fileName,
                        Store.YES,Index.NOT_ANALYZED));
        d.add(new Field("Path",f.getPath(),Store.YES,Index.ANALYZED));
        long tamano = f.length();
        d.add(new Field("Size",""+tamano,Store.YES,Index.ANALYZED));
        long fechalong = f.lastModified();
        d.add(new Field("Modification_Date",""+fechalong,Store.YES,Index.ANALYZED));
        indexWriter.addDocument(d);
    }

    indexWriter.optimize();
    indexWriter.close();
    int numDocs = indexWriter.numDocs();

    System.out.println("Index Directory=" + indexDir.getCanonicalPath());
    System.out.println("Doc Directory=" + docDir.getCanonicalPath());
    System.out.println("num docs=" + numDocs);
    System.out.println("num chars=" + numChars);

}

Thanks Edmondo1984, you are helping me a lot.

Finally I did the code as shown below. Storing the hash of the file, and then checking the modification date.

In 9300 index files takes 15 seconds, and re-index (without any index has not changed because no file) takes 15 seconds. Am I doing something wrong or I can optimize the code to take less?

Thanks jtahlborn, doing what I managed to equalize indexReader times to create and update. Are not you supposed to update an existing index should be faster to recreate it? Is it possible to further optimize the code?

if(IndexReader.indexExists(dir))
            {
                //reader is a IndexReader and is passed as parameter to the function
                //searcher is a IndexSearcher and is passed as parameter to the function
                term = new Term("Hash",String.valueOf(file.hashCode()));
                Query termQuery = new TermQuery(term);
                TopDocs topDocs = searcher.search(termQuery,1);
                if(topDocs.totalHits==1)
                {
                    Document doc;
                    int docId,comparedate;
                    docId=topDocs.scoreDocs[0].doc;
                    doc=reader.document(docId);
                    String dateIndString=doc.get("Modification_date");
                    long dateIndLong=Long.parseLong(dateIndString);
                    Date date_ind=new Date(dateIndLong);
                    String dateFichString=DateTools.timeToString(file.lastModified(), DateTools.Resolution.MINUTE);
                    long dateFichLong=Long.parseLong(dateFichString);
                    Date date_fich=new Date(dateFichLong);
                    //Compare the two dates
                    comparedates=date_fich.compareTo(date_ind);
                    if(comparedate>=0)
                    {
                        if(comparedate==0)
                        {
                            //If comparation is 0 do nothing
                            flag=2;
                        }
                        else
                        {
                            //if comparation>0 updateDocument
                            flag=1;
                        }
                    }

Can you code in Java? What don't you understand? Be more specific... — Edmondo1984
the code is some snippet from which you won't get what you want. You'd better learn how lucene works and write it from scratch — Edmondo1984
I know this code is only going to index a directory, but want to learn to do it in a small example and then be able to do in my example. And the idea of how it works lucene think I have it, but not how to do this in particular. — Jose Luis Vázquez López
creating IndexReader and IndexSearcher is expensive. re-use them for each call and your code will be much faster. — jtahlborn

Edmondo1984 Edmondo1984 · Accepted Answer · 2012-07-17T06:48:00

According to Lucene data model, you store documents inside the index. Inside each document you will have the fields that you want to index, which are so called "analyzed" and the fields which are not "analyzed", where you can store a timestamp and other information you might need later.

I have the feeling you have a certain confusion between files and documents, because in your first post you speak about documents and now you are trying to call IndexFileNames.isDocStoreFile(file.getName()) which actually tells only if file is a file containing a Lucene index.

If you understand Lucene object model, writing the code you need takes approximately three minutes:

You have to check if the document is already existing in the index (for example by storing a non-analyzed field containing a unique identifier), by simply querying Lucene.
If your query returns 0 documents, you will add the new document to the index
If your query returns 1 document, you will get its "timestamp" field and compare it to the one of the new document you are trying to store. Then you can use the docId of the document to delete it from the index, if necessary, to add the new one.

If on the other side you are sure that you want always to modify the previous value, you can refer to this snippet from Lucene in Action:

public void testUpdate() throws IOException { 
    assertEquals(1, getHitCount("city", "Amsterdam"));
    IndexWriter writer = getWriter();
    Document doc = new Document();
    doc.add(new Field("id", "1",
    Field.Store.YES,
    Field.Index.NOT_ANALYZED));
    doc.add(new Field("country", "Netherlands",
    Field.Store.YES,
    Field.Index.NO));
    doc.add(new Field("contents",
    "Den Haag has a lot of museums",
    Field.Store.NO,
    Field.Index.ANALYZED));
    doc.add(new Field("city", "Den Haag",
    Field.Store.YES,
    Field.Index.ANALYZED));
    writer.updateDocument(new Term("id", "1"),
    doc);
    writer.close();
    assertEquals(0, getHitCount("city", "Amsterdam"));
    assertEquals(1, getHitCount("city", "Den Haag"));
}

As you see, the snippets uses a non analyzed ID as I was suggesting to save a queryable - simple attribute, and method updateDocument to first delete and then re-add the doc.

You might want to directly check the javadoc at

http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,org.apache.lucene.document.Document)

incremental indexing lucene

1 Answers