For Common Terms Query
Low Frequency Words
- More important
- You can construct query to return documents in which all words of query string
must be present
(make use of "low_freq_operator": "and"
)
only some of them
(make use of "low_freq_operator": "or"
some percentage of them
(make use of minimum_should_match
)
High Frequency Words
- Less Important.
- You can construct query to
influence the score
in which all stop words in query string
must be considered
(make use of "high_freq_operator": "and"
)
only some of them
(make use of "high_freq_operator": "or"
)
some percentage of them
(make use of minimum_should_match
)
- Only influence the relevancy score.
- If no low frequency words exists, then its a typical
should
clause of all terms in query string
How does it categorize words as less frequent or more frequent
As per the LINK,
Terms are allocated to the high or low frequency groups based on the
cutoff_frequency, which can be specified as an absolute frequency
(>=1) or as a relative frequency (0.0 .. 1.0)....
Perhaps the most interesting property of this query is that it adapts
to domain specific stopwords automatically. For example, on a video
hosting site, common terms like "clip" or "video" will automatically
behave as stopwords without the need to maintain a manual list.
How it works with example
From this LINK,
The common terms query is a modern alternative to stopwords which improves the precision and recall of search results
(by taking stopwords into account), without sacrificing performance.
Let's say I have below documents:
Document 1: Is there stairway to this path?
Document 2: Is there a stairway to heaven?
Document 3: Stairway to heaven
.....
.....
Now say your search query is as below:
{
"query": {
"common": {
"body": {
"query": "stairway to heaven",
"cutoff_frequency": 0.001,
"low_freq_operator": "and"
}
}
}
}
When you use and
result would be Document 3 followed by Document 2
only. And when you make use of or
, result would be Document 3, Document 2, Document 1
respectively.
So when you use or
, high frequency word i.e. to
would be used here to influence the score. In a similar way the high_freq_operator
would apply for stop words however it would again be only used to influence the score.
So for your first query, hope the above explanation would suffice and as for below query,
Does it mean that if query has 4 low-frequency words, document with 2
of those will be returned as a hit, but document with only 1 of query
words will not be returned right?
Yes, that's correct.
Hope it helps!