0
votes

I am new to Lucene and trying to use it for searching log files/entries generated by a SystemA.

Architecture

  1. Receive each log entry (i.e. XML) in a INPUT Directory. SystemA sends log entries to a MQ queue which is polled by a small utility, that picks the message and create a file in INPUT directory.

  2. WriteIndex.java (i.e. IndexWriter/Lucene) keep checking if a new file received in INPUT directory. If yes, it takes the file, puts in Index and move the file to OUTPUT directory. As part of Indexing, I am putting filename, path, timestamp, contents in Index. "Note: I am creating index on Content as well putting whole Content as StringField."

  3. SearchIndex.java (ie. SeacherManager/Lucene/refereshIfChanged) is created. As part of Creation I started a new thread as well that keep checking every 1 min if Index has changed on not. I acquire IndexSearcher for every request. It's working fine.

Everything so far worked very fine. But I am not sure what will happen in production as I have tested it for few hundred files but in production, I will be getting like 500K log entries in a day which means 500K small file, each having an XML. "WriteIndex.java" will have to run non-stop to update index whenever new file received.

I have following questions

  1. Anyone has done any similar work? Any issues/best practices I should follow.

  2. Do you see any problem with Index files generated for such large number of xml files. Each XML file would be 2KB max. Remember I am indexing on the content as well as putting content as String in index so that I can retrieve from the index whenever I found a match on index while searching.

  3. I would be exposing SearchIndex.java as Servlet to allow admins to come on a WebPage and search log entries. Any issues you see with it?

Please let me know if anyone need anything specific.

Thanks, Rohit Goyal

1
There are three complex questions in one question here. It's very difficult to write a useful answer for something like that.femtoRgon
I can share some more information if you require.Rohit Goyal
One of code that I am looking for to merge segments. Each time when a new file added and indexed, I can see segements and other files getting created in index directory. Can I merge the segments somehow?Rohit Goyal

1 Answers

0
votes

Architecture looks fine.

Few things

  • Consider using TextField instead of StringField. TextField will be tokenized and hence user would be able to search on tokens. StringField is not tokenized and hence for document to match search, full text should match.
  • No problem in performance for lucene. Check out Lucene performance graphs. Lucene can generate index for over a billion wikipedia documents in minutes. Searching is fast too.