1
votes

I have documents which I am indexing with Lucene. These documents basically have a title (text) and body (text). Currently I am creating an index out of Lucene Documents with (amongst other fields) a single searchable field, which is basically title+" "+body. In this way, if you search for anything which occurs in the title or in the body, you will find the document.

However, now I have learned of the new requirement that matches in the title should cause the document to be "more relevant" than matches in the body. Thus, if there is a document with the title "Software design", and the user searches for "Software design", then that document should be placed higher up in the search results than a document called something else, which mentions software design a lot in the body.

I don't really have any idea how to begin implementing this requirement. I know that Google e.g. treats certain parts of the document as "more relevant" (e.g. text within <h1> tags), everyone here assumes Lucene supports something similar.

However,

  • The Javadoc for the Document class clearly states that fields contain text, i.e. not structured text where some parts are "more important" than other parts.
  • This blog post states "With Lucene, it is impossible to increase or decrease the weight of individual terms in a document."

I'm not really sure where to look. What would you suggest?

Any specific information (e.g. links to Lucene documentation) stating flatly that such a thing is not possible would also be helpful, then I needn't spend any further time looking for how to do it. (The software is already written with Lucene, so we won't re-write it now, so if Lucene doesn't support it, then there's nothing anyone (my boss) can do about that.)

2
I think you're talking about boosting fields, not terms.Xodarap

2 Answers

3
votes

Just use two fields, title and body, and while indexing boost 'title' field:

title.setBoost(float)

see here

1
votes

you probably should split the combine field become title and body separately, then use the run-time boost to give more relevancy for title field

the run-time query will be like

title:apache^20 body:apache

see - http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Boosting%20a%20Term