I'm trying to use the latest version of Lucene.NET (applied to my project via NuGet) in an Azure web role. The original web application (MVC4) has been created to be able to run in a traditional web host or in Azure: in the former case it uses a file-system based Lucene directory, writing the Lucene index in an *App_Data* subdirectory; in the latter case it uses the AzureDirectory installed from NuGet (Lucene.Net.Store.Azure).
The documents being indexed either come from the web, or from some files uploaded locally, as some of the collections to index are closed and rather small. To start with, I am trying with one of these small closed sets, counting about 1,000 files for a couple of GB.
When I index this set locally in my development environment, the indexing is completed and I can successfully use it for searching. When instead I try indexing on Azure, it fails to complete and I have no clue about the exact problem: I added both Elmah and NLog for logging any problem, but nothing gets registered in Elmah or in the monitoring tools configured from the Azure console. Only once I got an error from NLog, which was an out of memory exception thrown by Lucene index writer at the end of the process, when committing the document additions. So I tried:
- explicitly setting a very low RAM buffer size calling SetRAMBufferSizeMB(10.0) on my writer.
- committing multiple times, e.g. every 200 documents added.
- removing any call to Optimize after the indexing completes (see also http://blog.trifork.com/2011/11/21/simon-says-optimize-is-bad-for-you/ on this).
- targeting either the file system or the Azure storage.
- upscaling the web role VM up to the large size.
Most of these attempts fail at different stages: some times the indexing stops after 1-200 documents, some other times it gets up to 8-900; when I'm lucky, it even completes. This happened only for file system, and never for Azure storage: I never had luck to complete indexing with this.
The essential part of my Lucene code is very simple:
IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
writer.SetRAMBufferSizeMB(10.0);
where directory is an instance of FSDirectory or AzureDirectory, according to the test being executed. I then add documents with their fields (using UpdateDocument, as one of the fields represents a unique ID). Once finished, I call writer.Dispose(). If required by the test, I call writer.Commit() several times before the final Dispose; this usually helps the system go on before hitting the memory exception. Could anyone suggest a hint to be able to complete my indexing?