In a very simple case, I have three documents with filenames "Lark", "Larker", and "Larking" (no file extension). In solr, I index these three documents mapping the filename to a "title" field. When I do a search for "Lark" all three documents are returned (which is what I want) but they are all given the same score. I would prefer that "Lark" be scored the highest, as it is an exact match to my query, with the others coming behind.
<field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I believe the reason they are getting the same score is because of the EdgeNGramFilterFactory
employed at index time. Each document gets indexed as "La", "Lar", "Lark" with two of the documents ("Larker" and "Larking") being indexed with some additional variations. So in effect each document is an exact match for the query "Lark." I would like some way of executing a query where the term "Lark" would return all three documents but with the document titled "Lark" being returned higher than the others.
Results of query debug:
<lst name="debug">
<str name="rawquerystring">Lark</str>
<str name="querystring">Lark</str>
<str name="parsedquery">text:lark</str>
<str name="parsedquery_toString">text:lark</str>
<lst name="explain">
<str name="543d6ee4cbb33c26bbcf288b/xxnullxx/543d6ef9cbb33c26bbcf2892">
2.7104912 = (MATCH) weight(text:lark in 0) [DefaultSimilarity], result of:
2.7104912 = fieldWeight in 0, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
3.8332133 = idf(docFreq=3, maxDocs=68)
0.5 = fieldNorm(doc=0)
</str>
<str name="543d6ee4cbb33c26bbcf288b/xxnullxx/543d6ef9cbb33c26bbcf2893">
2.7104912 = (MATCH) weight(text:lark in 1) [DefaultSimilarity], result of:
2.7104912 = fieldWeight in 1, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
3.8332133 = idf(docFreq=3, maxDocs=68)
0.5 = fieldNorm(doc=1)
</str>
<str name="543d6ee4cbb33c26bbcf288b/xxnullxx/543d6ef9cbb33c26bbcf2894">
2.7104912 = (MATCH) weight(text:lark in 2) [DefaultSimilarity], result of:
2.7104912 = fieldWeight in 2, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
3.8332133 = idf(docFreq=3, maxDocs=68)
0.5 = fieldNorm(doc=2)
</str>
fieldNorm
should be lowest forLarking
and highest forLark
, soLark
should be getting the highest score. Can you rerun your query withdebugQuery=on&wt=xml
and check what fieldNorm you are getting for each doc? – arunfieldNorm
is the same for all three. – Mike Nitchie