2
votes

I have this problem that I am trying to solve for quite some time. I am not solr expert, I am still learning it.

I have a special type of ID's in my system, that have to be searchable by users. The problem is, that those ID's contain some solr special characters. By the way, those ID's are stored together with other search terms in the terms_txt field.

Some ID examples: 292/2017 and 1.2.61-962-37/2017
The first one I will refer as the 'simple one', and the second as 'complex one'.

From what I red throughout the internet, is that this kind of search should be possible if we do the phrase search. So if we add apostrophes around the ID, it should work. But unfortunately that is not the case. I will post here my solr 4.0 schema, and example of my query, hoping that you can spot what is wrong with it. If phrase search is the answer to my problem, then it must be that something is wrong with either solr schema or my query (code).

In my example I am searching for "292/2017" as a phrase. Only one entry in my index has this phrase, because this combination of characters is unique (it is some kind of ID, but we insert it in terms_txt field with all other terms)

This is query executed via solr admin, it finds a lot of results, but there should be only 1. It seems like solr handles '/' character as a space, and ignores terms shorter than 3 letters (ignoring less than 3 is what we want, but not in phrase search):

INFO: [collection1] webapp=/solr-example path=/select params={q=terms_txt:"44/2017"&wt=xml} hits=31343 status=0 QTime=6 

So basically, in this example, solr has found all records with the term of 2017, which is bad...

This is query executed withing application logic. It is more complex, but the problem is same:

INFO: [collection1] webapp=/solr-example path=/select params={mm=100%25&json.nl=flat&fl=id&start=0&sort=date_in_i+desc&fq=type_s:2&fq=date_in_i:[20161201+TO+*]&fq=date_in_i:[*+TO+20171011]&fq=subtype_s:(2+4+6+8)&fq=terms_txt:"\"10/2017\""&fq=language_is:0&rows=10&bq=&q=\"10\/2017\"&tie=0.1&defType=edismax&omitHeader=true&qf=terms_txt&wt=json} hits=978 status=0 QTime=2

This is how terms_txt entries looks like in index:

<arr name="terms_txt">
    <str>Some string blah blah 292/2017 - more of terms, blah blah</str>
    <str>Something else, blah blah</str>
</arr>

This is my solr schema field configuration for the terms_txt field (fields are dynamic):

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>          
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(^|\s)([^\-\_&amp;\s]+([\-\_&amp;]+[^\-\_&amp;\s]*)+)(?=(\s|$))" replacement="$1MжџљМ$2 $2" />
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&amp;]+" replacement="MжџљМ$1" />
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&amp;]+" replacement="MжџљМ$1" />
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\bMжџљМ([^\s]*?)\b[\-_&amp;]+" replacement="MжџљМ$1" />
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="MжџљМ" replacement="" />
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w)&amp;(\w)" replacement="$1and$2" />
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LengthFilterFactory" min="3" max="99"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\b[\-_]+\b" replacement="" />
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\w)&amp;(\w)" replacement="$1and$2" />
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LengthFilterFactory" min="3" max="99"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
  </analyzer>
</fieldType>

Anyone have any clue how should I allow special characters like .-/ to be searchable ? Can you spot some flaw in my example or suggest better solution ?

1
You should start by looking at what the analysis page for your content tells you - my guess is that the StandardTokenizer will remove a lot of special characters when tokenizing (and your PatternReplaces will remove content as well, not sure why you need the same replacement so many times, but you probably have a reason. The Whitespace Tokenizer might be better suited for a field for matching special characters. When writing the query, escape the special characters with `\`. - MatsLindh
@MatsLindh escaping special characters do not help - offline
It has to be done together with the other changes I proposed. - MatsLindh
The trick is to use different fields and have different tokenizers for those fields, then prioritize hits in those fields based on a weight. Instead of trying to make one field fit all your query needs, make multiple fields - one with each definition and query multiple fields. - MatsLindh
If you only want to query the field with whitespace tokenizer, you only give that field name in qf. If you want to search multiple fields, you give the fields and their weights, qf=original whitespace_tokenized^3 will weigh a hit in whitespace_tokenized three times higher than a hit in original. - MatsLindh

1 Answers

2
votes

You should start by looking at what the analysis page for your content tells you - my guess is that the StandardTokenizer will remove a lot of special characters when tokenizing (and your PatternReplaces might remove content as well).

The Whitespace Tokenizer is better suited for a field where matching special characters is important, since it'll only break on and remove whitespace.

Define different fields and use different tokenizers for those fields, then prioritize hits in those fields based on a weight. Instead of trying to make one field fit all your query needs, make multiple fields - one with each definition and query multiple fields. You can adjust weights by using qf together with the (e)dismax handlers. These handlers also allows you to boost phrase matches for two and three shingles.

Use one or more copyField instructions to get your content from one field to the other fields, so you don't have to change your indexing code to adjust how you tweak things in Solr.

If you append debugQuery=true to your query string, you can also see how Solr / Lucene computes the score for each document and what contributes to its ranking, so you can tweak scoring values and see exactly how the final score changes.

When writing the query, escape any special characters with \.