0
votes

If i have to remove certain keywords and then remove all spaces in the string during index analysis, using :

'analysis' => array(
                'filter' => array(
                  'whitespace_remove' => array(
                    'type' => 'pattern_replace',
                    'pattern' => ' ',
                    'replacement' => ''
                  ),
                  'my_stop' => array(
                    'type' => 'stop',
                    'stopwords' => array('bad', 'horrible', 'useless')
                ),
                  'edge' => array(
                    'type' => 'edge_ngram',
                    'min_gram' => '1',
                    'max_gram' => '5'
                  )

                ),

and the analyzer with

'keyword_space_ngram' => array(
                        'type' => 'custom',
                        'tokenizer' => 'keyword',
                        'filter' => array(
                            'lowercase', 
                            'my_stop',
                            'whitespace_remove',
                            'edge'

                        )
                    )

How do i ensure that i apply the filters in this order, that is convert to lowercase, remove keywords , remove spaces and then perform ngram analysis?

1
I thought the order in which they are defined is the one they are going to be used in. Isn't it? Do you have issues with that config? Can you provide more details?Andrei Stefan
I understood that the problem is using keyword tokenizer, which does not recognize the spaces at all. The filters are applied on the tokenized words. Hence my stop words in a bigger string won't be recognized. If I use a standard tokenizer, it will remove spaces only the tokenized strings. How do i ensure that tokenization happens after my filtering? Or is there any other workaround?Vinu K S
Figured it, by using character filters. The order of character filters, tokenizer and then regular filters cannot be changed.Vinu K S
Indeed ;-). Initially, I thought you are referring to the order of the filters themselves.Andrei Stefan

1 Answers

0
votes

You can remove stopwords and white_spaces with custom char_filter at index time:

  {
    "analysis": {
      "char_filter": {
        "whitespace_remove": {
          "type": "pattern_replace",
          "pattern": "\\s+",
          "replacement": ""
        },
        "custom_stop_words_char_filter": {
          "type": "mapping",
          "mappings": [
            "bad =>  ",
            "horrible =>  ",
            "useless =>  "
          ]
        }
      },
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["lowercase", "asciifolding"],
          "char_filter": ["custom_stop_words_char_filter", "whitespace_remove"]
        }
      }
    }
  }
  • This will transform bad angry man to angryman, for example

  • For adding your edge_ngram filter just add edge at the end of your filter array

  • Note: your stop words will only be substituted if they are lowercase