1
votes

Does anyone know why this is happening? I'm doing basic indexing + SAX parsing of XML files and adding each path as a new field in the document. I got to like 1.5million files and its stuck on this one file for 30 minutes and the .nrm (normalizing file??) gets larger and larger.

I don't know why this is happening, my IndexWriter is of the form:

writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED)

Is this not optimal to use for large indices? Why is it frozen on this one file? I've ran it multiple times with over 1 million XML files and its persistently getting stuck on different XML files (not just this one in particular--whose structure is fine).

Edit:

So let's say I'm indexing files 2000 at a time with separate java commands. After the indexing is complete and I call the indexwriter close method, am I missing anything if I want to rewrite to this index ? Should I optimize the index? I think I recall Lucene in Action saying to optimize if you won't be writing to it for a while.

Actually, this method worked for 1.8 million files, but after I tried to add more in batches of 2000, this NRM file and another one wrote to around 70GB! Why would memory run out from the JVM if the java Lucene indexing function is called only in batches of 2000? It doesn't seem like a garbage colelcting problem, unless I'm needing to add something to the Lucene code before I close the index writer.

Edit 2:

I have about 4 million XML files that look like :

<?xml version="1.0" encoding="UTF-8"?>
<person>
   <name>Candice Archie
   </name>
   <filmography>
      <direct>
         <movie>
            <title>Lucid (2006) (V)
            </title>
            <year>2006
            </year>
         </movie>
      </direct>
      <write>
         <movie>
            <title>Lucid (2006) (V)
            </title>
            <year>2006
            </year>
         </movie>
      </write>
      <edit>
         <movie>
            <title>Lucid (2006) (V)
            </title>
            <year>2006
            </year>
         </movie>
      </edit>
      <produce>
         <movie>
            <title>Lucid (2006) (V)
            </title>
            <year>2006
            </year>
         </movie>
      </produce>
   </filmography>
</person>

I parse these XML files and add the contents into a field of the path, for example /person/produce/filmography/movie/title , Lucid (2006) (V)

The thing is, I'm looking to compute statistics of frequency of a given term within field instances of a document, for each document in the index (then sum over this value over all documents)... so if there were two instances of /person/produce/filmography/movie/title and they both contained "Lucid", I'd want two. the tf(t in d) Lucene gives would give 3 if there was another path (ex: /person/name: Lucid), but does not do this for terms within similar fields within a document.

The heart of the Lucene Indexing is doing this:

public void endElement( String namespaceURI,String localName,String qName ) throws SAXException {
  if(this.ignoreTags.contains(localName)){
      ignoredArea = false;
      return;
  }
  String newContent = content.toString().trim();
  if(!empty.equals(newContent) && newContent.length()>1)
  {
      StringBuffer stb = new StringBuffer();
      for(int i=0; i<currpathname.size();i++){
      //System.out.println(i + "th iteration of loop. value:" + currpathname.get(i).toString() + k + "th call.");
      stb.append("/");
      stb.append(currpathname.get(i));
      }
      //stb.append("0");
      if(big.get(stb.toString())==null){
          big.put(stb.toString(), 1);
      }
      else{
          big.put(stb.toString(),big.get(stb.toString())+1);
      }
      if(map.get(stb.toString())==null){
          map.put(stb.toString(),0);
          stb.append(map.get(stb.toString())); //ADDED THIS FOR ZERO
      }
      else
      {
          map.put(stb.toString(),map.get(stb.toString())+1);
          stb.append(map.get(stb.toString()));
      }
      doc.add(new Field(stb.toString(),newContent,Field.Store.YES, Field.Index.ANALYZED));
      seenPaths.add(stb);
      //System.out.println(stb.toString());// This will print all fields indexed for each document (separating nonunique [e.x.: /person/name0 /person/name1]
      //System.out.println(newContent);
  }
  currpathname.pop();   
  content.delete(0,content.length()); //clear content
  //This method adds to the Lucene index the field of the unfolded Stack variable currpathname and the value in content (whitespace trimmed).

}

Map and BigMap are hashmaps (don't worry about bigmap, it's used for something else. map is instantiated whenever a new XML file (Document object) is instantiated. There is a method endDocument() that adds the document after all the startElement, endElement, and character methods are called (These are Xerces Parser methods)

  public void endDocument( ) throws SAXException {
  try {
      numIndexed++;
    writer.addDocument(doc);
} catch (CorruptIndexException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
}
  }

Sorry for the long post--thanks for your help! Also, I don't think the server is the problem. I ran the code on 4 million files at once and it ran out of heap memory even when i used Xmx12000M Xms12000M

it's a powerful server, so it can definitely handle this...

Edit 3:

Hello again! Thanks and you're right. Lucene probably wasn't made to do this. We are actually going to do other experiments, but I think I solved the problem with the help of your ideas and some others. First, I stopped normalizing the fields and that shrunk the index's size many times. Also, I played with the mergedocs and rambuffer methods and upped them. The indexing greatly improved. I'm gonna mark the question answered with your help:) Thanks.

1

1 Answers

1
votes

Try indexing in batches. The bellow code should give you an idea how to do it. Also I would recommend to check out the latest edition of Lucene in Action.

Most likely you are overloading the garbage collector (assuming there is no memory leaks which are hard to find) which should eventually give your out of memory error.

    private static final int FETCH_SIZE = 100;
    private static final int BATCH_SIZE = 1000;

    //Scrollable results will avoid loading too many objects in memory
    ScrollableResults scroll = query.scroll(ScrollMode.FORWARD_ONLY);
    int batch = 0;
    scroll.beforeFirst();
    while (scroll.next()) {
        batch++;

        index(scroll.get(0)); //index each element

        if (batch % BATCH_SIZE == 0) {
            //flushToIndexes(); //apply changes to indexes
            //optimize();
            //clear(); //free memory since the queue is processed
        }
    }