Lucene .NET 3 and Azure: unable to complete indexing

Question

I'm trying to use the latest version of Lucene.NET (applied to my project via NuGet) in an Azure web role. The original web application (MVC4) has been created to be able to run in a traditional web host or in Azure: in the former case it uses a file-system based Lucene directory, writing the Lucene index in an *App_Data* subdirectory; in the latter case it uses the AzureDirectory installed from NuGet (Lucene.Net.Store.Azure).

The documents being indexed either come from the web, or from some files uploaded locally, as some of the collections to index are closed and rather small. To start with, I am trying with one of these small closed sets, counting about 1,000 files for a couple of GB.

When I index this set locally in my development environment, the indexing is completed and I can successfully use it for searching. When instead I try indexing on Azure, it fails to complete and I have no clue about the exact problem: I added both Elmah and NLog for logging any problem, but nothing gets registered in Elmah or in the monitoring tools configured from the Azure console. Only once I got an error from NLog, which was an out of memory exception thrown by Lucene index writer at the end of the process, when committing the document additions. So I tried:

explicitly setting a very low RAM buffer size calling SetRAMBufferSizeMB(10.0) on my writer.
committing multiple times, e.g. every 200 documents added.
removing any call to Optimize after the indexing completes (see also http://blog.trifork.com/2011/11/21/simon-says-optimize-is-bad-for-you/ on this).
targeting either the file system or the Azure storage.
upscaling the web role VM up to the large size.

Most of these attempts fail at different stages: some times the indexing stops after 1-200 documents, some other times it gets up to 8-900; when I'm lucky, it even completes. This happened only for file system, and never for Azure storage: I never had luck to complete indexing with this.

The essential part of my Lucene code is very simple:

IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
writer.SetRAMBufferSizeMB(10.0);

where directory is an instance of FSDirectory or AzureDirectory, according to the test being executed. I then add documents with their fields (using UpdateDocument, as one of the fields represents a unique ID). Once finished, I call writer.Dispose(). If required by the test, I call writer.Commit() several times before the final Dispose; this usually helps the system go on before hitting the memory exception. Could anyone suggest a hint to be able to complete my indexing?

Simply, the writer's Commit method throws an OutOfMemoryException. — Naftis

rae1 rae1 · Accepted Answer · 2013-06-03T00:38:07

The error seems to hold the key: Lucene is running out of memory while indexing.

From my perspective you have two options:

Allocate more memory to the RAM buffer, which actually improves your performance (refer to the Lucene documentation on the subject) or,
Reduce the number of documents between each commit.

You could try unit testing your indexing job under several different configurations (more RAM vs. less documents) until you come up with a suitable combination for your application.

In the other hand, if the problem is strictly with the Azure server you might want to try to use a local file cache instead of a RAM cache.

Lucene .NET 3 and Azure: unable to complete indexing

1 Answers