0
votes

I've successfully implemented stemming for elasticsearch and thus when I search for "code" I hit upon "codes" and "coding" etc.

My problem arises when I try to make use of the "must_not" field in my queries. When I include "code" in the "must_not" field, it's fine and I still get my results as expected but when I search for "codes" I don't get back any results even though there are documents which have the word "codes" in them for sure.

My query is as follows:

for(i = 0; i < exclude_words.length; i++)
{
  must_not.push({term:{text:exclude_words[i].toLowerCase()}});
}
query = {
  "filtered": {
    "query": {
      "dis_max": {
        "queries": [
          {"match": {"text": term}},
          {"match": {"title": term}}
        ]
      }
    },
    "filter": {
      "bool": {
        "must_not": must_not
      }
    } 
  }
}

I'm using the elasticsearch api for node.js to construct my queries and get results from elasticsearch.

I'm assuming I'm having this problem because of stemming and that "codes" is stored as "code" in the search index.

Is there a way to solve this without using an external algorithm to stem my queries as well? Or is there an elegant way to solve this issue?

Any help is much appreciated!

Update

This is my analyzer:

{
 "settings": {
  "analysis": {
    "analyzer": {
      "stopword_analyzer": { 
        "type": "snowball", 
        "stopwords": ["a", "able", "about", "across", "after", "all",      "almost", "also", "am", "among", "an", "and", "any", "are", "as", "at", "be", "because", "been", "but", "by", "can", "cannot", "could", "dear", "did", "do", "does", "either", "else", "ever","every", "for", "from", "get", "got", "had", "has", "have", "he", "her", "hers", "him", "his", "how", "however", "i", "if", "in", "into", "is", "it", "its", "just", "least", "let", "like",  "may", "me", "might", "most", "must", "my", "neither", "no", "nor", "not", "of", "off", "often", "on", "only", "or", "other", "our", "own", "rather", "said", "say", "says", "she", "should", "since", "so", "some", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "tis", "to", "too", "us", "wants", "was", "we", "were", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "would", "yet", "you", "your"]
     }
   }
 }
}

The text field has the following mapping:

"text": {
    "type": "string",
    "analyzer": "stopword_analyzer"
  }
1
What's the mapping of the text field?Andrei Stefan
Updated the question with it, thanks!Vishal Rao

1 Answers

3
votes

When I include "code" in the "must_not" field, it's fine and I still get my results as expected

It's not about must_not it's about the term filter you use in must_not. The term filter will take your search text - "code" or "codes" or whatever - and it will use the exact value for filtering.

But, the analyzer you are using is changing the terms being indexed. For example, if you want to index "coding" you actually will have (as terms in the inverted index) in the index "code". Remember that term will actually search for exact values. So, if you search for "codes" it will not be found as the single term in your document is "code".

I suggest trying out match instead of term in the must_not part as that will use the analyzer at search time as well. Something like this:

  "filter": {
    "bool": {
      "must_not": [
        {
          "query": {
            "match": {
              "text": "codes"
            }
          }
        },
        {
          "query": {
            "match": {
              "text": "coding"
            }
          }
        }
      ]
    }
  }