I am new to Lucene and trying to use it for searching log files/entries generated by a SystemA.
Architecture
Receive each log entry (i.e. XML) in a INPUT Directory. SystemA sends log entries to a MQ queue which is polled by a small utility, that picks the message and create a file in INPUT directory.
WriteIndex.java (i.e. IndexWriter/Lucene) keep checking if a new file received in INPUT directory. If yes, it takes the file, puts in Index and move the file to OUTPUT directory. As part of Indexing, I am putting filename, path, timestamp, contents in Index. "Note: I am creating index on Content as well putting whole Content as StringField."
SearchIndex.java (ie. SeacherManager/Lucene/refereshIfChanged) is created. As part of Creation I started a new thread as well that keep checking every 1 min if Index has changed on not. I acquire IndexSearcher for every request. It's working fine.
Everything so far worked very fine. But I am not sure what will happen in production as I have tested it for few hundred files but in production, I will be getting like 500K log entries in a day which means 500K small file, each having an XML. "WriteIndex.java" will have to run non-stop to update index whenever new file received.
I have following questions
Anyone has done any similar work? Any issues/best practices I should follow.
Do you see any problem with Index files generated for such large number of xml files. Each XML file would be 2KB max. Remember I am indexing on the content as well as putting content as String in index so that I can retrieve from the index whenever I found a match on index while searching.
I would be exposing SearchIndex.java as Servlet to allow admins to come on a WebPage and search log entries. Any issues you see with it?
Please let me know if anyone need anything specific.
Thanks, Rohit Goyal