Solr dismax behaviour - punctuation and white space splitting

Question

I have a Solr 4.7.0 instance, with 200 000 documents in the index (one document per file on a filesystem), used by several users. Documents are identified by keywords, that are indexed and stored in one field called "signature_1". During the index, I remove all type of punctuation that I replace with white space (thanks to a ScriptUpdateProcessor), so my keywords are separated with white spaces, both in the index and stored part of the field signature_1 (fieldtype signature).

<fieldType name="signature" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-zA-Z0-9éèàùêâûôîäëöüï])" replacement=" "/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="1000" consumeAllTokens="false"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang\stopwords_fr.txt" enablePositionIncrements="true" />-->
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers_secteurs.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French" />
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-zA-Z0-9éèàùêâûôîäëöüï])" replacement=" "/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang\stopwords_fr.txt" enablePositionIncrements="true" />-->
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French" />
  </analyzer>
</fieldType>

I would like the same behaviour during the query time : if somebody search for

A-B-C

I would like Solr to do the following search (with an OR operator, dismax) :

A B C

So basically, I simply want Solr to search between document's keywords, punctuation beeing removed.

The upper example is working well, but in some case it's not working this way. With a query of

A B-C

Dismax splits the query in

(+(DisjunctionMaxQuery((signature_1:a)) DisjunctionMaxQuery((signature_1:"b c"))) ())/no_coord

and this messes up the relevancy (i.e. the order) of my results. I tried using autoGeneratePhraseQueries="True" but without effect.

So I would like Dismax to always split on whitespace AND punctuation or never do it (results will be the same). Any idea how I can manage to do this (without having to create my Java Dismax class) ?

The following posts are related to my problem :

femtoRgon femtoRgon · Accepted Answer · 2014-09-22T22:59:52

I'm not really clear on whether you want A B-C to be a phrase query ("A B C") or three separate term queries (A B C), but:

If you want it to be a phrase query, just wrap the whole thing in quotes: "A B-C"

If you want each term to be searched separately, just remove the punctuation yourself, leaving A B C.

The query parser separates query clauses at spaces, generally, not punctuation. This doesn't have to do with analysis, it's just query parser syntax. So, for A B-C, you end up with two query clauses, A and B-C. When analysis kicks in, B-C is split into two terms, and so the query parser makes it a phrase query instead of a term query, and in the end result looks something like A "B C"

Solr dismax behaviour - punctuation and white space splitting

2 Answers