Lucene: how can I find query hit positions in original contents?

Question

Suppose I have a document collection that I have indexed in Lucene. I submit a query and get hits. Now what I want is to find where in a particular document hit(s) occur(s). I know that I can use the Lucene Highlighting classes to obtain relevant fragments. But how can I find out where exactly these fragments appear in the original contents?

A related question is how to make sure the found fragments are actually very close to the original query? I noticed in my experiments with highlighting that often I would have a multi-word query and it would return fragments that would have only some of these words. But what if I want to make sure I get hits with all the words?

Thanks!

I Z, I have noticed, when you ask a question, you don't always accept or upvote an answer. Well, actually you never did that. You see, this deter SO members from answering your questions. For me, I'm not all about reputation points, but it would be nice to know whether my answer helped or not. So please follow up on your questions, at least add a some comment. Thanks — bpgergo

bpgergo bpgergo · Accepted Answer · 2011-12-07T19:53:43

Not an actual answer, just a few links to a solution to a similar problem.

First of all, here you can see the actual results of the highlighting (note that were is highlighted though am was in the query. Stemming is an additional feature of this implementation): http://hunglish.hu/search?huSentence=&enSentence=I%20am%20highlighted&size=20&page=2&doc.genre=-10

Here's the source. Look for these methods: highlightField, highlightBisen http://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/Searcher.java

Disclaimer: I wrote this a while ago, it is not very nice code, and it is buggy in special cases: there is an open issue relating to highlighting. Furthermore, it uses version 3.2.0 of the lucene-highlighter, which is possibly not the newest.

Anyway, I hope if you look at how it works, it helps you write a better one, or at least something that works as expected.

Lucene: how can I find query hit positions in original contents?

1 Answers