1
votes

Given a series of documents containing text, I'd like to search for phrases and return all the matches and rank them. I know how to get lucene/solr to indicate which documents matches, and do highlighting within the document, but how do I get a ranking that includes multiple matches from the same document?

First document.  It has a single line of text.
Second document.  This text line is quite short.
This is another line containing more text and is a bit longer.

If I searched for "text line", then I'd like it to find three matches, ranked as follows:

2nd document -> ...This "text line" is quite short.
1st document -> ...It has a single "line of text".
2nd document -> ...another "line containing more text" and is...

Is this possible? How?

1
I originally had a more complicated question, which included this, here: stackoverflow.com/questions/8883390/…Chris Leishman
Why do you want document2 twice in the results? May be you should index each line as a document...naresh
that's what i said, every line as a document if you want matches to be lines.milan
I want document 2 in the results twice, because it has two different matches that have different rankings. But I can't separate each line, because my sources files are a stream of text, and a search for a phrase must match over newline boundaries.Chris Leishman

1 Answers

-1
votes

If you want to have one match per line, then make each line its own document. Don't let the term "document" be confused with whether the text is actually a single file.

If you want to maintain a link back to the file, just index the id as well in a different (stored) field.

{ id: "myfile.txt",
  text: "first line" }

{ id: "myfile.txt",
  text: "second line" }