3
votes

I have a single XML file that I want to index using Lucene.NET. The file is basically a large collection of logs. Since the single file itself is beyond 5GB and I am developing code on a system with 2GB RAM, how can I perform the indexing when I am not parsing the file nor am I creating any other fields other than "text" which shall contain the file data?

I am using some code from CodeClimber and at the moment not sure what would be the best approach to index such a large single file.

Is there a way to pass on file data to the index in chunks? Below is the line of code that basically creates the text field and the associated file data

Document doc = new Document();
doc.Add(new Field("Body", text, Field.Store.YES, Field.Index.TOKENIZED));
writer.AddDocument(doc);

Thank you for the guidance

2

2 Answers

3
votes

You should use something like System.Xml.XmlReader that doesn't load the whole xml into the memory. But indexing the whole xml as a single document doesn't make sense since you will get either 1 or 0 document with each search.(found or not found). So to be able to pass data in chunks wouldn't help you much. Therefore while reading your xml file you should split it into many documents(and fields) so that you can get some reasonable results when you search.

how can I perform the indexing when I am not parsing the file nor am I creating any other fields other than "text" which shall contain the file data

what a wonderful world it would be

0
votes

Indexing such large files is no problem. Just parse your XML file using a SAX parser (which is event-based and doesn't need to load the file into memory to process it), buffer your input and then add the document to your IndexWriter at the end of every log event.