1
votes

I am looking for a way to make ES search the data with multiple analyzers. NGram analyzer and one or few language analyzers.

Possible solution will be to use multi-fields and explicitly declare which analyzer to use for each field.

For example, to set the following mappings:

  "mappings": {
    "my_entity": {
      "properties": {
        "my_field": {
          "type": "text",
          "fields": {
            "ngram": {
              "type": "string",
              "analyzer": "ngram_analyzer"
            },
            "spanish": {
              "type": "string",
              "analyzer": "spanish"
            },
            "english": {
              "type": "string",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }

The problem with that is that I have explicitly write every field and its analyzers to a search query. And it will not allow to search with "_all" and use multiple analyzers.

Is there a way to make "_all" query use multiple analyzers? Something like "_all.ngram", "_all.spanish" and without using copy_to do duplicate the data?

Is it possible to combine ngram analyzer with a spanish (or any other foreign language) and make a single custom analyzer? I have tested the following settings but these did not work:

    PUT /ngrams_index
    {
       "settings": {
          "number_of_shards": 1,
          "analysis": {
            "tokenizer": {
              "ngram_tokenizer": {
                 "type": "nGram",
                 "min_gram": 3,
                 "max_gram": 3
              }
            },
            "filter": {
                "ngram_filter": {
                   "type": "nGram",
                   "min_gram": 3,
                   "max_gram": 3
              },

              "spanish_stop": {
                "type":       "stop",
                "stopwords":  "_spanish_" 
              },
              "spanish_keywords": {
                "type":       "keyword_marker",
                "keywords":   ["ejemplo"] 
              },
              "spanish_stemmer": {
                "type":       "stemmer",
                "language":   "light_spanish"
              }


            },
            "analyzer": {
              "ngram_analyzer": {
                 "type": "custom",
                 "tokenizer": "ngram_tokenizer",
                 "filter": [
                    "lowercase",
                    "spanish_stop",
                    "spanish_keywords",
                    "spanish_stemmer"
                 ]
              }
            }
          }
       },
       "mappings": {
          "my_entity": {
            "_all": {
                    "enabled": true,
                    "analyzer": "ngram_analyzer"
             },

             "properties": {
                "my_field": {
                    "type": "text",
                    "fields": {
                          "analyzer1": {
                            "type": "string",
                            "analyzer": "ngram_analyzer"
                          },
                          "analyzer2": {
                                  "type": "string",
                                 "analyzer": "spanish"
                          },
                          "analyzer3": {
                                  "type": "string",
                                 "analyzer": "english"
                          }
                    }
                }
             }
          }
       }
    }



    GET /ngrams_index/_analyze
    {
      "field": "_all",   
      "text": "Hola, me llamo Juan."
    }

returns: just ngram results, without Spanish analysis

where

    GET /ngrams_index/_analyze
    {
      "field": "my_field.analyzer2",   
      "text": "Hola, me llamo Juan."
    }

properly analyzes the search string.

Is it possible to build a custom analyzer which combine Spanish and ngram?

1

1 Answers

2
votes

There is a way to create a custom ngram+language analyzer:

PUT /ngrams_index
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "ngram_filter": {
          "type": "nGram",
          "min_gram": 3,
          "max_gram": 3
        },
        "spanish_stop": {
          "type": "stop",
          "stopwords": "_spanish_"
        },
        "spanish_keywords": {
          "type": "keyword_marker",
          "keywords": [
            "ejemplo"
          ]
        },
        "spanish_stemmer": {
          "type": "stemmer",
          "language": "light_spanish"
        }
      },
      "analyzer": {
        "ngram_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "spanish_stop",
            "spanish_keywords",
            "spanish_stemmer",
            "ngram_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_entity": {
      "_all": {
        "enabled": true,
        "analyzer": "ngram_analyzer"
      },

      "properties": {
        "my_field": {
          "type": "text",
          "analyzer": "ngram_analyzer"
        }
      }
    }
  }
}


GET /ngrams_index/_analyze
{
  "field": "my_field",
  "text": "Hola, me llamo Juan."
}