I have a Solr 4.7.0 instance, with 200 000 documents in the index (one document per file on a filesystem), used by several users. Documents are identified by keywords, that are indexed and stored in one field called "signature_1". During the index, I remove all type of punctuation that I replace with white space (thanks to a ScriptUpdateProcessor), so my keywords are separated with white spaces, both in the index and stored part of the field signature_1 (fieldtype signature).
<fieldType name="signature" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-zA-Z0-9éèàùêâûôîäëöüï])" replacement=" "/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="1000" consumeAllTokens="false"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang\stopwords_fr.txt" enablePositionIncrements="true" />-->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers.txt" ignoreCase="true" expand="false"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers_secteurs.txt" ignoreCase="true" expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-zA-Z0-9éèàùêâûôîäëöüï])" replacement=" "/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang\stopwords_fr.txt" enablePositionIncrements="true" />-->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers.txt" ignoreCase="true" expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" />
</analyzer>
</fieldType>
I would like the same behaviour during the query time : if somebody search for
A-B-C
I would like Solr to do the following search (with an OR operator, dismax) :
A B C
So basically, I simply want Solr to search between document's keywords, punctuation beeing removed.
The upper example is working well, but in some case it's not working this way. With a query of
A B-C
Dismax splits the query in
(+(DisjunctionMaxQuery((signature_1:a)) DisjunctionMaxQuery((signature_1:"b c"))) ())/no_coord
and this messes up the relevancy (i.e. the order) of my results. I tried using autoGeneratePhraseQueries="True" but without effect.
So I would like Dismax to always split on whitespace AND punctuation or never do it (results will be the same). Any idea how I can manage to do this (without having to create my Java Dismax class) ?
The following posts are related to my problem :