3
votes

I'm indexing in Lucene, and am only interested in getting the ID's of relevant documents out of Lucene (ie, not field values, or any highlighting information). Given these requirements, which term vector should I use, without impacting on search performance (speed), or quality (results)? I will also be using MoreLikeThis so don't want

TermVector.YES—Records the unique terms that occurred, and their counts, in each document, but doesn’t store any positions or offsets information

TermVector.WITH_POSITIONS—Records the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets

TermVector.WITH_OFFSETS—Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term, but no positions

TermVector.WITH_POSITIONS_OFFSETS—Stores unique terms and their counts, along with positions and offsets

Thanks.

1
you want the internal lucene doc number or some Id that you store in it?Jf Beaulac

1 Answers

0
votes

It depends on the type of your queries...if you have any related data with your IDs then you will want to have positions and/or offets.

if you have a document like this: "blah blah blah date blah ID blah name blah"

and you just want to find that specific ID then TermVector Yes is fine. However, if you want to find the ID based on how close it is to a date or a name (with advanced queries), you will need the additonal term positions.

You can always try this out and it is an easy change, assuming you do not have to unit test a billion record index or something :)

BTW...check our "Lucene in Action" the book covers all of this information.