I get Solr results for following:
- Sports
- World Health Organisation
- percent
but I don't get results for the below:
- Sport (UK)
- World Health Organisat
- 1-percent
All these are in the text field which definitely contains these phrases and i have used a ngram filter on the indexer so the combination do exist. While the analysis tab of the solr UI shows me exactly what i am expecting, i am not getting the required results on my java output.
My solrj code is as below:
query.setQuery("full_text:\"World Health Organisation\"");
Also, I have to add the \".."\
as I always get errors in my front end if I remove them and half the results I otherwise get also don't turn up.
Can someone help with what I might be missing?
Much thanks!
Edit Inclusion: Definition of full_text in schema.xml
<field name="full_text" type="text_en" indexed="true" stored="false" multiValued="true"/>
<copyField source="title" dest="full_text"/>
<copyField source="content" dest="full_text"/>
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="20"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Solution: I figured out what the problem was. For cases of "Sports (UK)" and "1-percent", the tokeniser I was using was removing all special characters and so I have change my tokeniser. As for "World Health Organisation:, it was caused by the stemmer which changed Organisation to Organis and query like "Organisat" was kept as it is. Hence I did not get results. So I removed the stemmer as I am using a ngram filter.
Hope this helps others in the long run. :)