8
votes

We have a program, which runs continually, does various things, and changes some records in our database. Those records are indexed using Lucene. So each time we change an entity we do something like:

  1. open db transaction, open Lucene IndexWriter
  2. make the changes to the db in the transaction, and update that entity in Lucene by using indexWriter.deleteDocuments(..) then indexWriter.addDocument(..).
  3. If all went well, commit the db transaction and commit the IndexWriter.

This works fine, but over time, the indexWriter.commit() takes more and more time. Initially it takes about 0.5 seconds but after a few hundred such transactions it takes more than 3 seconds. I don't doubt it would take even longer if the script ran longer.

My solution so far has been to comment out the indexWriter.addDocument(..) and indexWriter.commit(), and recreate the entire index every now and again by first using indexWriter.deleteAll() then re-adding all documents, within one Lucene transction/IndexWriter (about 250k documents in about 14 seconds). But this obviously goes against the transactional approach offered by databases and Lucene, which keeps the two in sync, and keeps the updates to the database visible to users of our tools who are searching using Lucene.

It seems strange that I can add 250k documents in 14 seconds, but adding 1 document takes 3 seconds. What am I doing wrong, how can I improve the situation?

2
can you just fix it with backround tasks? you'll probably have a 10seconds penalty, but that can be OK for many applicationsAdamSkywalker
@AdamSkywalker - but it gets slower and slower, what about when it takes 1hr, or 10hrs, or 2 days?Adrian Smith

2 Answers

16
votes

What you are doing wrong is assuming that Lucene's built-in transactional capabilities have performance and guarantees comparable to a typical relational database, when they really don't. More specifically in your case, a commit syncs all index files with the disk, making commit times proportional to index size. That is why over time your indexWriter.commit() takes more and more time. The Javadoc for IndexWriter.commit() even warns that:

This may be a costly operation, so you should test the cost in your application and do it only when really necessary.

Can you imagine database documentation telling you to avoid doing commits?

Since your main goal seems to be to keep database updates visible through Lucene searches in a timely manner, to improve the situation, do the following:

  1. Have indexWriter.deleteDocuments(..) and indexWriter.addDocument(..) trigger after a successful database commit, instead of before
  2. Perform indexWriter.commit() periodically instead of every transaction, just to make sure your changes are eventually written to disk
  3. Use a SearcherManager for searching and invoke maybeRefresh() periodically to see updated documents within a reasonable time frame

The following is an example program which demonstrates how document updates can be retrieved by periodically performing maybeRefresh(). It builds an index of 100000 documents, uses a ScheduledExecutorService to set up periodic invocations of commit() and maybeRefresh(), prompts you to update a single document, then repeatedly searches until the update is visible. All resources are properly cleaned up on program termination. Note that the controlling factor for when the update becomes visible is when maybeRefresh() is invoked, not commit().

import java.io.IOException;
import java.nio.file.Paths;
import java.util.Scanner;
import java.util.concurrent.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;

public class LucenePeriodicCommitRefreshExample {
    ScheduledExecutorService scheduledExecutor;
    MyIndexer indexer;
    MySearcher searcher;

    void init() throws IOException {
        scheduledExecutor = Executors.newScheduledThreadPool(3);
        indexer = new MyIndexer();
        indexer.init();
        searcher = new MySearcher(indexer.indexWriter);
        searcher.init();
    }

    void destroy() throws IOException {
        searcher.destroy();
        indexer.destroy();
        scheduledExecutor.shutdown();
    }

    class MyIndexer {
        IndexWriter indexWriter;
        Future commitFuture;

        void init() throws IOException {
            indexWriter = new IndexWriter(FSDirectory.open(Paths.get("C:\\Temp\\lucene-example")), new IndexWriterConfig(new StandardAnalyzer()));
            indexWriter.deleteAll();
            for (int i = 1; i <= 100000; i++) {
                add(String.valueOf(i), "whatever " + i);
            }
            indexWriter.commit();
            commitFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    indexWriter.commit();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 5, 5, TimeUnit.MINUTES);
        }

        void add(String id, String text) throws IOException {
            Document doc = new Document();
            doc.add(new StringField("id", id, Field.Store.YES));
            doc.add(new StringField("text", text, Field.Store.YES));
            indexWriter.addDocument(doc);
        }

        void update(String id, String text) throws IOException {
            indexWriter.deleteDocuments(new Term("id", id));
            add(id, text);
        }

        void destroy() throws IOException {
            commitFuture.cancel(false);
            indexWriter.close();
        }
    }

    class MySearcher {
        IndexWriter indexWriter;
        SearcherManager searcherManager;
        Future maybeRefreshFuture;

        public MySearcher(IndexWriter indexWriter) {
            this.indexWriter = indexWriter;
        }

        void init() throws IOException {
            searcherManager = new SearcherManager(indexWriter, true, null);
            maybeRefreshFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    searcherManager.maybeRefresh();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 0, 5, TimeUnit.SECONDS);
        }

        String findText(String id) throws IOException {
            IndexSearcher searcher = null;
            try {
                searcher = searcherManager.acquire();
                TopDocs topDocs = searcher.search(new TermQuery(new Term("id", id)), 1);
                return searcher.doc(topDocs.scoreDocs[0].doc).getField("text").stringValue();
            } finally {
                if (searcher != null) {
                    searcherManager.release(searcher);
                }
            }
        }

        void destroy() throws IOException {
            maybeRefreshFuture.cancel(false);
            searcherManager.close();
        }
    }

    public static void main(String[] args) throws IOException {
        LucenePeriodicCommitRefreshExample example = new LucenePeriodicCommitRefreshExample();
        example.init();
        Runtime.getRuntime().addShutdownHook(new Thread() {
            @Override
            public void run() {
                try {
                    example.destroy();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        });

        try (Scanner scanner = new Scanner(System.in)) {
            System.out.print("Enter a document id to update (from 1 to 100000): ");
            String id = scanner.nextLine();
            System.out.print("Enter what you want the document text to be: ");
            String text = scanner.nextLine();
            example.indexer.update(id, text);
            long startTime = System.nanoTime();
            String foundText;
            do {
                foundText = example.searcher.findText(id);
            } while (!text.equals(foundText));
            long elapsedTimeMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime);
            System.out.format("it took %d milliseconds for the searcher to see that document %s is now '%s'\n", elapsedTimeMillis, id, text);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            System.exit(0);
        }
    }
}

This example was successfully tested using Lucene 5.3.1 and JDK 1.8.0_66.

3
votes

My first approach: do not commit that often. When you delete and re-add document you will probably trigger a merge. Merges are somewhat slow.

If you use a near real-time IndexReader you can still search like you used to (it does not show deleted documents), but you do not get the commit penalty. You can always commit later, to make sure the file system is in sync with your index. You can do this while using your index, so you do not have to block all other operations.

See also this interesting blog post (and do read the other posts as well, they provide great information).