0
votes

I'm creating a simple search engine using Elasticsearch 7.7 and the python elasticsearch_dsl package version 7.0.0. I'm using the simple_query_string search, because I'd like to enable most common search functionality (boolean operators, phrase search) without having to parse the query myself. This is largely working well except for the phrase match functionality.

I would like to ensure all results will include a phrase match if one is in the query. E.g. How google works - If I search for "green eggs" and ham, there will be no results that do not include "green eggs".

Let's assume I have 3 documents in my index:

{
   "question":"I love my phrase",
   "background: "dont you"
},
{
   "question":"I love my phrase",
   "background: "and other terms"
},
{
   "question":"I have other terms",
   "background: "and more"
}

What I am seeing now:

As expected, the below query only returns the first two documents, which have "my phrase" in one of the fields.

    {
      'simple_query_string':
        {
          'query': '"my phrase"',
          'fields': ['question', 'background']
        }
     }

Contrary to what I expect, the below query will return all 3 results, with the 3rd one scored higher than the 1st.

    {
      'simple_query_string':
        {
          'query': '"my phrase" other terms',
          'fields': ['question', 'background']
        }
     }

How can I alter my query so that searching for '"my phrase" other terms' will not return the 3rd document because it does not contain the phrase search, but score the 2nd document higher than the 1st because it contains additional search terms outside of the phrase?

Things I have tried that have not worked:

  • 'query': '"my phrase" AND (other terms)'
  • 'query': '"my phrase" AND other terms'

Thank you

1

1 Answers

0
votes

Contrary to what I expect, the below query will return all 3 results

By default words in query combine with OR operator: see description for default_operator parameter in simple_query_string documentation. Your second query is interpreted as "my phrase" OR other OR terms, so it will return all 3 results: each document contains at least one of the terms "my phrase", other, terms.

How can I alter my query so that searching for '"my phrase" other terms' will not return the 3rd document because it does not contain the phrase search, but score the 2nd document higher than the 1st because it contains additional search terms outside of the phrase?

AFAIK, this isn't possible with simple_query_string search. You can try to use query_string search, which have feature named boolean operators. Using that feature you can write query which provide desired result:

{
  "query": {
    "query_string": {
      "query": "+\"my phrase\" other terms",
      "fields": ["question", "background"]
    }
  }
}