3
votes

I am using Custom NGRAM Analyzer which has a ngram tokenizer. I have also used lowercase filter. The query is working fine for searches without characters. But when I am searching for certain symbols, it fails. Since I have used lower case tokenizers, Elasticsearch doesn't analyse symbols. I know whitespace tokenizer can help me solve the issue. How can I use two tokenizers in a single analyzer?Below is the mapping:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer":"my_tokenizer",
          "filter":"lowercase"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter", 
            "digit"
          ]
        }
      }
    }
  },
    "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }

}

Is there a way I could solve this issue?

3

3 Answers

3
votes

As per the documentation of elasticsearch,

An analyzer must have exactly one tokenizer.

However, you can have multiple analyzer defined in settings, and you can configure separate analyzer for each field.

If you want to have single field itself to be used using different analyzer, one of the option is to make that field multi-field as per this link

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "whitespace"
          "fields": {
            "ngram": { 
              "type":  "text",
              "analyzer": "my_analyzer"
            }
          }
        }
      }
    }
  }
}

So if you configure as above your query need to make use of title and title.ngram fields.

GET my_index/_search
{
  "query": {
    "multi_match": {
      "query": "search @#$ whatever",
      "fields": [ 
        "title",
        "title.ngram"
      ],
      "type": "most_fields" 
    }
  }
}

As another option, here is what you can do

  • Create two indexes.
  • The first index have field title with analyzer my_analyzer
  • Second index have field title with analyzer whitespace
  • Have same alias created for both of them as below

Execute the below:

POST _aliases
{  
   "actions":[  
      {  
         "add":{  
            "index":"index A",
            "alias":"index"
         }
      },
      {  
         "add":{  
            "index":"index B",
            "alias":"index"
         }
      }
   ]
}

So when you eventually write a query, it must be pointing to this alias which in turn would be querying multiple indexes.

Hope this helps!

1
votes

1) You can try with updating your token_chars like below:

      "token_chars":[
        "letter",
        "digit",
        "symbol",
        "punctuation"
      ]

2) If not work then try below analyzer:

{
  "settings":{
    "analysis":{
      "filter":{
        "my_filter":{
          "type":"ngram",
          "min_gram":3,
          "max_gram":3,
          "token_chars":[
            "letter",
            "digit",
            "symbol",
            "punctuation"
          ]
        }
      },
      "analyzer":{
        "my_analyzer":{
          "type":"custom",
          "tokenizer":"keyword",
          "filter":[
            "lowercase",
            "like_filter"
          ]
        }
      }
    }
  },
  "mappings":{
    "_doc":{
      "properties":{
        "title":{
          "type":"text",
          "analyzer":"my_analyzer"
        }
      }
    }
  }
}

You need to use keyword tokenizer (keyword tokenizer) and then ngram token filter in your analyzer.

0
votes

if you want to use 2 tokenizers, You should have 2 analyzers. something like this:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer":"my_tokenizer",
          "filter":"lowercase"
        },
        "my_analyzer_2": {
          "tokenizer":"whitespace",
          "filter":"lowercase"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter", 
            "digit"
          ]
        }
      }
    }
  },
    "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }

}

In general, you should also pay attention to the place of the analyzer in the Mapping. Sometimes it is necessary to have the analyzer in both the serach_time and index_time.

"mappings":{
    "_doc":{
      "properties":{
        "title":{
          "type":"text",
          "analyzer":"my_analyzer",
          "search_analyzer":"my_analyzer"
        }
      }
    }
  }