0
votes

I'm using Lucene.Net to implement a search website (to search PDFs). Once the keyword is entered, I display the results, and when one of the result items is clicked, I want to take the user to a "details" page, where I want to display snippets from that PDF document everywhere the keyword is found.

So my question is, what's the best way to gather these snippets from that document?

  1. Do I just take the selected item id, re-query on just that document, and let Lucene's highlighter give me the collection of snippets?

  2. Or, since I already have the text content for each result record, would it be better to manually process the snippets using C# string manipulation?

If it is 1., could you please point me to an example of how to write a query to search a single document in Lucene?

Thanks.

1

1 Answers

0
votes

You should probably use Lucene Highlight package because your query and document will need tokenized using the same analyzer which was used to index the document. Using C# directly via string methods can work but you'd have to use the same tokenizing logic to match the query terms the document text (such as stemming, stop words, etc.). If you are storing the full text of the document in the index, then using the highlighter is simple. You could also fetch the document text from someplace else if you dont store the text in the index. You will need to pass the same query used in the initial search, and include an exact match for the document you want to highlight, for example by appending a required clause to the query for that documents unique ID. The query used for the single document should have 2 required clauses, the first clause is the original query used to find the document initially and the other clause is some unique identifier for that single document. That way the highlighter can use the same query to generate highlighted snippets.