1
votes

Is it possible to boost a document on the indexing stage depending on the field value?

I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.

This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.

Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:

q={!boost b=sqrt(length)}text:abcd

Example: I have two items in the DB:

ABCDEBCE
ABCD

I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.

The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...

Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)

2

2 Answers

0
votes

Yes and no - but how you do it depends on how you're indexing your documents.

As far as I know there's no way of resolving this only on the solr server side at the moment.

If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.

0
votes

You can check upon DIH Special Commands which has a $docBoost command

$docBoost : Boost the current doc. The value can be a number or the toString of a number

However, there seems no $fieldBoost Command.

For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.