3
votes

I am having an issue with solr result and I thought I'd ask for suggestions here.

I have enabled phonetic matching by including <filter class="solr.PhoneticFilterFactory" encoder="RefinedSoundex" inject="true"/> both at query and index level, also with encoder DoubleMetaphone as a variation.

The issue here is that solr is returning only phonetically matched result and disregarding wildcard match or almost exact search phrase match.

Example:

In my index, I have a document with a field called 'name' and value 'Modenine', When I search for name:mod , I get a "Modenine" which is OK,

But when I search using name:mode , note the extra 'e', it returns 'Something Foul Mouth' and this is because, mouth phonetically matches mode, I don't mind having 'Something Foul Mouth' as a result but I also want to see 'Modenine' since mode is the actual search term.

The quickest solution that comes to my head is have a way to add the phonetic code to index during indexing, then use dismax to rank the result by providing score using ^2.0 for example.

I have the following: Field declarations

<field name="phoneticName" type="phonetics" indexed="true" stored="true"/>
<field name="name" type="phonetics" indexed="true" stored="true"/> 

FieldType for phonetics

<fieldType name="phonetics" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index"> 
        <filter class="solr.LowerCaseFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.PhoneticFilterFactory" encoder="RefinedSoundex" inject="true"/>
    </analyzer>
    <analyzer type="query">             
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>        
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PhoneticFilterFactory" encoder="RefinedSoundex" inject="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />       
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>               
    </analyzer>
</fieldType>

But After re-indexing, the phoneticName field only has the exact value of the name field, it doesn't store the phonetic code which I aim to searchby.

I found this solr-boosting-down-phonetic-variations but doesn't have much detail.

Thanks P

2

2 Answers

3
votes

I finally got it to work when I enter mod as query, I get about 5 related result including modenine. How I managed to do this is by using Ngram filter which is not something I just found out, In fact, I've had Ngram filter added to list of filters in schema.xml from inception but never really worked as anticipated.

The mistake is that I am applying NgramFilter at both index and query level/phase. Ngram should only be added at index phase, After removing the Ngram filer at query phase, I got required result.

See config below, Notice how I have added: solr.RemoveDuplicatesTokenFilterFactory to remove possible duplicates from NGramFilterFactory filters.

<fieldType name="phonetics" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">         
        <filter class="solr.TrimFilterFactory"/>        
        <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="1000" />
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="1000"  />
        <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="0" 
        generateWordParts="1" stemEnglishPossessive="0" generateNumberParts="0"
        catenateWords="1" catenateNumbers="0" catenateAll="0" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>        
        <filter class="solr.DoubleMetaphoneFilterFactory" inject="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">     
        <filter class="solr.TrimFilterFactory"/>        
        <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="0" 
        generateWordParts="1" stemEnglishPossessive="0" generateNumberParts="0"
        catenateWords="1" catenateNumbers="0" catenateAll="0" preserveOriginal="1"/>        
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>        
        <filter class="solr.LowerCaseFilterFactory"/>       
        <filter class="solr.DoubleMetaphoneFilterFactory" inject="true"/>
    </analyzer>
</fieldType>

Cheers

Babajide

1
votes

You aren't getting wildcard matches because you aren't performing a wildcard search. name:mode* would match "modenine", though it would not match phonetically, since wildcard/prefix searches are not analyzed, which makes sense, because phonetic algorithms work on the assumption that they are working with a complete word.

If you want to search on both, you should use a query like: name:mode name:mode*.