0
votes

Trying to use "https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-common-terms-query.html" but cannot make one particular thing to work: Add high-frequency words scores to total score, ONLY if all low-frequency words from query has been matched.

Tried using "low_freq_operator": "and" but it makes all low-frequency words from query required - which I don't know.

Also - if I use

"minimum_should_match": {
    "low_freq" : "50%",
}

Does it mean that if query has 4 low-frequency words, document with 2 of those will be returned as a hit, but document with only 1 of query words will not be returned right?

Thanks.

1

1 Answers

0
votes

For Common Terms Query

Low Frequency Words

  • More important
  • You can construct query to return documents in which all words of query string
    • must be present (make use of "low_freq_operator": "and")
    • only some of them (make use of "low_freq_operator": "or"
    • some percentage of them (make use of minimum_should_match)

High Frequency Words

  • Less Important.
  • You can construct query to influence the score in which all stop words in query string
    • must be considered (make use of "high_freq_operator": "and")
    • only some of them (make use of "high_freq_operator": "or")
    • some percentage of them (make use of minimum_should_match)
  • Only influence the relevancy score.
  • If no low frequency words exists, then its a typical should clause of all terms in query string

How does it categorize words as less frequent or more frequent

As per the LINK,

Terms are allocated to the high or low frequency groups based on the cutoff_frequency, which can be specified as an absolute frequency (>=1) or as a relative frequency (0.0 .. 1.0)....

Perhaps the most interesting property of this query is that it adapts to domain specific stopwords automatically. For example, on a video hosting site, common terms like "clip" or "video" will automatically behave as stopwords without the need to maintain a manual list.

How it works with example

From this LINK,

The common terms query is a modern alternative to stopwords which improves the precision and recall of search results (by taking stopwords into account), without sacrificing performance.

Let's say I have below documents:

Document 1: Is there stairway to this path?
Document 2: Is there a stairway to heaven?
Document 3: Stairway to heaven
..... 
.....

Now say your search query is as below:

{
    "query": {
        "common": {
            "body": {
                "query": "stairway to heaven",
                "cutoff_frequency": 0.001,
                "low_freq_operator": "and"
            }
        }
    }
}

When you use and result would be Document 3 followed by Document 2 only. And when you make use of or, result would be Document 3, Document 2, Document 1 respectively.

So when you use or, high frequency word i.e. to would be used here to influence the score. In a similar way the high_freq_operator would apply for stop words however it would again be only used to influence the score.

So for your first query, hope the above explanation would suffice and as for below query,

Does it mean that if query has 4 low-frequency words, document with 2 of those will be returned as a hit, but document with only 1 of query words will not be returned right?

Yes, that's correct.

Hope it helps!