0
votes

If I define my index with this analyzer (C#):

settings = new
{
    index = new
    {
        number_of_shards = 1,
        number_of_replicas = 1,

        analysis = new
        {
             analyzer = new
             {
                 analyzer_standard_with_html_strip = new
                 {
                     type = "standard",
                     char_filter = new string[] { "html_strip" },
                     stopwords = "_english_"
                 },

What does the type field do? Does it base the analyzer on the standard analyzer? If I don't have the type line at all it seems to work. This, from https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html, seems to suggest you don't need it:

In this example, we configure the html_strip character filter to leave tags in place:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

There, the analyzer has no type specified. Shouldn't it be "custom"?

So, what does the type field do when you're defining an analyzer? What is the difference between

"my_analyzer": {
  "type": "standard",
  "tokenizer": "keyword",
  "char_filter": ["my_char_filter"]
}

and

"my_analyzer": {
  "type": "custom",
  "tokenizer": "keyword",
  "char_filter": ["my_char_filter"]
}

and

"my_analyzer": {
  "tokenizer": "keyword",
  "char_filter": ["my_char_filter"]
}

?

1

1 Answers

0
votes

When you define a custom analyzer, you are supposed to specify "type": "custom" or leave out the type setting but it's not a good practice and doesn't help convey the meaning of what you're doing.

You can also specify "type": "standard", but only if you're only configuring a standard analyzer, like for instance here we're configuring an english analyzer, but it is not a custom one.

"my_english_analyzer": {
  "type": "standard",
  "max_token_length": 5,
  "stopwords": "_english_"
}

So your analyzer analyzer_standard_with_html_strip should be of type custom. If you want to reuse the standard analyzer in your custom analyzer but add a character filter, you can redefine the standard analyzer as a custom one, i.e. use the same tokenizer and token filters + add the character filter, like this:

"analyzer_standard_with_html_strip": {
  "type": "custom",
  "tokenizer": "standard",                 <--- like standard
  "filter": [ "standard", "lowercase" ],   <--- like standard
  "char_filter": ["my_char_filter"]        <--- this is custom 
}