Custom analyzer with edgeNgram filter doesn't work

Question

I needed partial search in my website. Initially I used edgeNgramFeild directly it didn't work as expected. So I used custom search engine with custom analyzers.I am using Django-haystack.

'settings': {
       "analysis": {
           "analyzer": {
               "ngram_analyzer": {
                   "type": "custom",
                   "tokenizer": "lowercase",
                   "filter": ["haystack_ngram"]
               },
               "edgengram_analyzer": {
                   "type": "custom",
                   "tokenizer": "lowercase",
                   "filter": ["haystack_edgengram"]
               },
               "suggest_analyzer": {
                   "type":"custom",
                   "tokenizer":"standard",
                   "filter":[
                       "standard",
                       "lowercase",
                       "asciifolding"
                   ]
               },
           },
           "tokenizer": {
               "haystack_ngram_tokenizer": {
                   "type": "nGram",
                   "min_gram": 3,
                   "max_gram": 15,
               },
               "haystack_edgengram_tokenizer": {
                   "type": "edgeNGram",
                   "min_gram": 2,
                   "max_gram": 15,
                   "side": "front"
               }
           },
           "filter": {
               "haystack_ngram": {
                   "type": "nGram",
                   "min_gram": 3,
                   "max_gram": 15
               },
               "haystack_edgengram": {
                   "type": "edgeNGram",
                   "min_gram": 2,
                   "max_gram": 15
               }
           }
       }
   }

Used edgengram_analyzer for indexing and suggest_analyzer for search. This worked for some extent. But,it doesn't work for numbers for example when 30 is entered it doesn't search for 303 and also with words containing alphabet and numbers combined. So I searched for various sites.

They suggested to use standard or whitespace tokenizer and with haystack_edgengram filter. But it didn't work at all, putting aside number partial search didn't work even for alphabet. The settings after the suggestion:

'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_ngram"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitepsace",
                    "filter": ["haystack_edgengram"]
                },
                "suggest_analyzer": {
                    "type":"custom",
                    "tokenizer":"standard",
                    "filter":[
                        "standard",
                        "lowercase",
                        "asciifolding"
                    ]
                },
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15
                }
            }
        }
    }

Does anything other than lowercase tokenizer work with django-haystack? or haystack_edgengram filter not working for me. According my knowledge it should work like this. Considering 2 Lazy Dog as text supplied. it should get tokens like this with whitespace [2,Lazy,Dog]. and then applying haystack_edgengram filter it should generate tokens [2,la,laz,lazy,do,dog] .its not working like this.Did i do something wrong?

My requirement is for example for text 2 Lazy Dog when some one types 2 Laz it should work.

Edited:

In my assumption the lowercase tokenizer worked properly. But, in case of above text it will omit 2 and creates token [lazy,dog]. Why can't standard or whitespace tokenizer work?

jgr jgr · Accepted Answer · 2017-10-26T11:37:37

In ngrams filter you define min_gram which is minimum length of created tokens. In your case '2' has length: 1 so this is ignored in ngram filters.

The easiest way to fix this is to change min_gram to 1. A bit more complicated way can be to combine some standard analyzer to match whole keyword (useful for shorter terms) and ngram analzyer for partial matching (for longer terms) - maybe with some bool queries.

You can also change ngrams to start from '1' characters but in your search box require at least 3 letters before send query to Elasticsearch.

Custom analyzer with edgeNgram filter doesn't work

2 Answers