Solr/Lucene Scorer

11

votes

We are currently working on a proof-of-concept for a client using Solr and have been able to configure all the features they want except the scoring.

Problem is that they want scores that make results fall in buckets:

Bucket 1: exact match on category (score = 4)
Bucket 2: exact match on name (score = 3)
Bucket 3: partial match on category (score = 2)
Bucket 4: partial match on name (score = 1)

First thing we did was develop a custom similarity class that would return the correct score depending on the field and an exact or partial match.

The only problem now is that when a document matches on both the category and name the scores are added together.

Example: searching for "restaurant" returns documents in the category restaurant that also have the word restaurant in their name and thus get a score of 5 (4+1) but they should only get 4.

I assume for this to work we would need to develop a custom Scorer class but we have no clue on how to incorporate this in Solr. Another option is to create a custom SortField implementation similar to the RandomSortField already present in Solr.

Maybe there is even a simpler solution that we don't know about.

All suggestions welcome!

lucenesolr

3

votes

Scorer are parts of lucene Queries via the 'weight' query method.

In short, the framework calls Query.weight(..).scorer(..) . Have a look at

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Query.html

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Weight.html

http://lucene.apache.org/jva/2_4_0/api/org/apache/lucene/search/Scorer.html

To use your own Query class in Solr, you'll need to implement your own solr QueryParserPlugin that uses your own QParser that generates your previously implemented lucene Query. You then can use it in Solr specified here:

http://wiki.apache.org/solr/SolrPlugins#QParserPlugin

This part on implementation should stay simple as this is just some glueing code.

Enjoy hacking Solr!

3

votes

You can override the logic solr scorer uses. Solr uses DefaultSimilarity class for scoring.

Make a class extending DefaultSimilarity and override the functions tf(), idf() etc according to your need:

public class CustomSimilarity extends DefaultSimilarity {

  public CustomSimilarity() {
    super();
  }

  public float tf(int freq) {
    //your code  
    return (float) 1.0;
  }

  public float idf(int docFreq, int numDocs) {
    //your code
    return (float) 1.0;
  }

}

After creating the class compile and make a jar.

Put the jar in lib folder of corresponding index or core.

Change the schema.xml of corresponding index: <similarity class="<your package name>.CustomSimilarity"/>

You can check out various factors affecting score here

For your requirement you can create buckets if your score is in specific range. Also read about field boosting, document boosting etc. That might be helpful in your case.

2

votes

I believe that Solr's DisMaxRequestHandler can do the trick for you.

Here are hossman's explanation of the dismax and Mark Miller's survey of query parsers.

2

votes

Thanks for the nice answers above. Just adding to them, after setting this up in Solr 4.2.1, which allows per-field similarity. (Prior to Solr 4, you could only alter the similarity for all fields globally.)

Let's say we want Solr to not use inverse document frequency (idf) for a specific field - we should write our own custom Similarity for this, like mentioned above:

package com.mycompany.similarity;

import org.apache.lucene.search.similarities.DefaultSimilarity;

public class NoIDFSimilarity extends DefaultSimilarity
{
    @Override
    public float idf(long docFreq, long numDocs)
    {
        return 1.0f;
    }

    @Override
    public String toString()
    {
        return "NoIDFSimilarity";
    }
}

and then in our schema.xml define a new fieldType like this:

<fieldType name="int_no_idf" 
           class="solr.TrieIntField" 
           precisionStep="0" 
           positionIncrementGap="0" 
           omitNorms="true">
    <similarity class="com.mycompany.similarity.NoIDFSimilarity"/>
</fieldType>

and use it on a field like this:

<field name="tag_id_no_idf" 
       type="int_no_idf" 
       indexed="true" 
       stored="false" 
       multiValued="true" />

If we did only this much, then you will get the following exception:

SEVERE: Unable to create core: SimilarList
org.apache.solr.common.SolrException: FieldType 'int_no_idf' is configured with a similarity, but the global similarity does not support it: class org.apache.solr.search.similarities.DefaultSimilarityFactory
    at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:466)
    at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:122)
    at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:1018)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1051)
    at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
    at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
    at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Apr 25, 2013 5:02:08 PM org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException: Unable to create core: SimilarList
    at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1672)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1057)
    at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
    at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
    at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.solr.common.SolrException: FieldType 'int_no_idf' is configured with a similarity, but the global similarity does not support it: class org.apache.solr.search.similarities.DefaultSimilarityFactory
    at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:466)
    at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:122)
    at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:1018)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1051)
    ... 10 more

A google search leads you to this, so simply add this line in your schema.xml, which will be applied to rest of the fields:

<similarity class="solr.SchemaSimilarityFactory"/>

(From that link: But keep in mind that coord and queryNorm (=1.0f) are not implemented now, so you will get different scores for TF-IDF!)

Solr/Lucene Scorer

4 Answers