2
votes

I am looking for a way to search company names with keyword tokenizing but without stopwords.

For ex : The indexed company name is "Hansel und Gretel Gmbh."

Here "und" and "Gmbh" are stop words for the company name.

If the search term is "Hansel Gretel", that document should be found, If the search term is "Hansel" then no document should be found. And if the search term is "hansel gmbh", the no document should be found as well.

I have tried to combine keywords tokenizer with stopwords in custom analyzer but it didnt work(as expected I guess).

I have also tried to use common terms query, but "Hansel" started to hit(again as expected)

Thanks in advance.

1
Please be a bit more specific when asking a question: What have you tried so far with a code example? (I downvoted because there is no code) / What do you expect? / What error do you get? For Help take a look at "How to ask"Hille
Please provide you settings, document and queryIvan Mamontov
@Ivan, I could not figure out which settings and which query I should use. Actually that is the question. For the document, just consider a String field.(It has to be analyzed some how for the stopwords I guess, but the question is how to analyze then...)user3088282
@user3088282 then please write it in your question with an edit.Hille
@Hille, I think I have expressed that I expect and I havent got any error. I have tried to cobine keyword filter with stopwords as custom analyzer, but didnt work. The question is not a why question more than that it is a how question because I could not figure out how to achieve the purpose. If you think that what I try to do is not clear, then let me think on another way to express it.user3088282

1 Answers

1
votes

There are two ways bad and ugly. The first one uses regular expressions in order to remove stop words and trim spaces. There are a lot of drawbacks:

  • you have to support white-space tokenization(regexp(/s+)) and special symbol(.,;) removal by your own
  • no highlight is supported - keyword tokenizer does not support
  • case sensitivity is also a problem
  • normalizers(analyzers for keywords) are experimental feature - bad support, no features

Here is step-by-step example:

curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "normalizer": {
        "custom_normalizer": {
          "type": "custom",
          "char_filter": ["stopword_char_filter", "trim_char_filter"],
          "filter": ["lowercase"]
        }
      },
      "char_filter": {
        "stopword_char_filter": {
          "type": "pattern_replace",
          "pattern": "( ?und ?| ?gmbh ?)",
          "replacement": " "
        },
        "trim_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\s+)$",
          "replacement": ""
        }
      }
    }
  },
  "mappings": {
    "file": {
      "properties": {
        "name": {
          "type": "keyword",
          "normalizer": "custom_normalizer"
        }
      }
    }
  }
}'

Now we can check how our analyzer works(please note that requests to normalyzer are supported only in ES 6.x)

curl -XPOST "http://localhost:9200/test/_analyze" -H 'Content-Type: application/json' -d'
{
  "normalizer": "custom_normalizer",
  "text": "hansel und gretel gmbh"
}'

Now we are ready to index our document:

curl -XPUT "http://localhost:9200/test/file/1" -H 'Content-Type: application/json' -d'
{
  "name": "hansel und gretel gmbh"
}'

And the last step is search:

curl -XGET "http://localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match" : {
            "name" : {
                "query" : "hansel gretel"
            }
        }
    }
}'

Another approach is:

  • create standard text analyzer with stop words filter
  • use analysis to filter out all stop words and special symbols
  • concatenate tokens manually
  • send term to ES as keyword

Here is step-by-step example:

curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type":      "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "custom_stopwords"]
        }
      }, "filter": {
        "custom_stopwords": {
          "type": "stop",
          "stopwords": ["und", "gmbh"]
        }
      }
    }
  },
  "mappings": {
    "file": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "custom_analyzer"
        }
      }
    }
  }
}' 

Now we are ready to analyze our text:

POST test/_analyze
{
  "analyzer": "custom_analyzer",
  "text": "Hansel und Gretel Gmbh."
}

with the following result:

{
  "tokens": [
    {
      "token": "hansel",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "gretel",
      "start_offset": 11,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

The last step is token concatenation: hansel + gretel. The only drawback is manual analysis with custom code.