Lucene.net high CPU usage while building index

Question

I've written a program that uses Lucene.net to index a 3GB text file. When the index is being built, the CPU consumption of the process reaches as high as 80 and the memory usage goes uptill ~1GB itself. Is there a way we can limit the CPU and memory usage? Below is the program I'm using for building index-

public void BuildIndex(string item)
        {
            System.Diagnostics.EventLog.WriteEntry("LuceneSearch", "Indexing Started for " + item);
            string indexPath = string.Format(BaseIndexPath, "20200414", item);
            if (System.IO.Directory.Exists(indexPath))
            {
                System.IO.Directory.Delete(indexPath, true);
            }


            LuceneIndexDirectory = FSDirectory.Open(indexPath);
            Writer = new IndexWriter(LuceneIndexDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);


            Writer.SetRAMBufferSizeMB(500);

            string file = "c:\LogFile.txt";
            string line=string.Empty;
            int count = 0;
            StreamReader fileReader = new StreamReader(file);
            while ((line = fileReader.ReadLine()) != null)
            {
                count++;
                Document doc = new Document();

                try
                {
                    doc.Add(new Field("LineNumber", count.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
                    doc.Add(new Field("LogTime", line.Substring(6, 12), Field.Store.YES, Field.Index.NOT_ANALYZED));
                    doc.Add(new Field("LineText", line.Substring(18, line.Length -18 ), Field.Store.YES, Field.Index.NOT_ANALYZED));
                    Writer.AddDocument(doc);
                }
                catch (Exception)
                {

                    System.Diagnostics.EventLog.WriteEntry("LuceneSearch", "Exception ocurred while entring a line in the index");
                }

            }
            System.Diagnostics.EventLog.WriteEntry("LuceneSearch", "Indexing finished for " + item + ". Starting Optimization now.");
            Writer.Optimize();
            Writer.Commit();

            Writer.Close();


            LuceneIndexDirectory.Dispose();

            System.Diagnostics.EventLog.WriteEntry("LuceneSearch", "Optimization finished for " + item );
        }

NightOwl888 NightOwl888 · Accepted Answer · 2020-11-29T17:39:13

Writing an index is generally done out of band with the search. That is, it is typically either done during deployment or application startup. Of course, it is also possible to have a near real-time search which involves keeping an open IndexWriter that is used both for writing to and searching the same index, but in that case a typical application adds a few documents at a time, it doesn't build the entire index at once.

Generally speaking, if you are building the index at the right point in your application lifecycle, using so much RAM is not a big deal.

However, you are calling Optimize() with no arguments, which is rewriting your entire index just after you create it. If your written index is taking up multiple segments, calling Optimize() with no arguments will rewrite the entire index to a single segment.

From the documentation (emphasis mine):

Requests an "optimize" operation on an index, priming the index for the fastest available search. Traditionally this has meant merging all segments into a single segment as is done in the default merge policy, but individaul merge policies may implement optimize in different ways.

It is recommended that this method be called upon completion of indexing. In environments with frequent updates, optimize is best done during low volume times, if at all.

See http://www.gossamer-threads.com/lists/lucene/java-dev/47895 for more discussion.

Note that optimize requires 2X the index size free space in your Directory (3X if you're using compound file format). For example, if your index size is 10 MB then you need 20 MB free for optimize to complete (30 MB if you're using compound fiel format).

If some but not all readers re-open while an optimize is underway, this will cause > 2X temporary space to be consumed as those new readers will then hold open the partially optimized segments at that time. It is best not to re-open readers while optimize is running.

Note that the Optimize() method was removed in Lucene 4.x (for good reason), so I would recommend that you stop using it now.

Lucene.net high CPU usage while building index

1 Answers