I have prototyped a search/browse application in C# using Lucene.Net. The source data is a single modest 5MB XML file (containing about 900 "documents") that I index using Lucene. My searches are working fine and are plenty fast. For this application, browsing and viewing each "hit" document is important, so the user can select a hit and see a complete view of that document (which usually fits inside half the screen), and I need the matching search terms highlighted in that view. I am using WPF and an MVVM approach. The document view is currently implemented with about a dozen ContentControls of which six are for displaying searchable fields that use a highlightConverter.
The performance was quite poor, so I added stopwatch timing to isolate the problem. The HighlightSearchTerms method in my model seems to be the culprit (about 100-600 ms to execute). If I short-circuit this method to just return the input text, the performance is fine.
Here is the method:
_analyzer is a StandardAnalyzer(_luceneVersion)
_parser is a QueryParser(_luceneVersion, “content”, _analyzer)
_formatter is a SimpleHTMLFormatter(“|~S~|”, “|~E~|”);
private string HighlightSearchTerms(string text, string queryString)
{
var query = new BooleanQuery();
query.Add(_parser.Parse(queryString), Occur.SHOULD);
var fragmentScorer = new QueryScorer(query);
var highlighter = new Highlighter(_formatter, fragmentScorer);
highlighter.TextFragmenter = new NullFragmenter();
var tokenStream = _analyzer.TokenStream(null, new StringReader(text));
string highlightedText = highlighter.GetBestFragment(tokenStream, text);
return highlightedText == null ? text : highlightedText;
}
A few years back I read the "Lucene In Action" book and have again thumbed through relevant portions to see if I could get any ideas. I've also searched the net a good bit. So, here are a couple of questions or areas of possible exploration.
- Can I omit scoring somehow? I don't need to show context of matching search terms, so I don't need to break up the hit document into fragments and get a "score" for the various fragments. I want the list of hits shown by title and then when the user selects one hit, the whole hit document is displayed with highlighting. I see how to use NullFragmenter and GetBestFragment, but I don't know whether that short-circuits the scoring operation. Would omitting the scoring improve performance?
- I have considered refactoring my view to have a single widget for displaying a hit document as one blob of HTML or RTF text. That way, I could call the highlight method only one time instead of 10 or 15 times (some ContentControls are inside an ItemsControl so there are multiple instances of some fields on the view). I expect this would significantly boost performance. The highlighting would be on text that was marked up with table formatting and such, but I suppose that would still work?
- Is there something else I am missing that makes my highlight method so slow? Half a second seems way too slow - like I am really messing up something basic.