Using ElasticSearch v1.7.2 and a fairly large index, I'm getting a different doc count for the following two searches, which use a fuzzy search in a query_string.
Query:
{
"query": {
"query_string": {
"query": "rapt~4"
}
}
}
Filter:
{
"filter": {
"query": {
"query_string": {
"query": "rapt~4"
}
}
}
}
The filter gives about 5% more results than the query. Why would the document counts be different? Are there options that I can specify to make them consistent?
Note that this inconsistency only occurs when I use a moderately sized dataset. I have tried inserting just a few (<10) documents that match the filter but not the query into a clean cluster, after which both my query and my filter successfully do match all documents. However, in a cluster of a single index, a single type, and a couple hundred documents, I start to see the discrepancy.
Using the explain=true option, it appears that the query score is computed using the Practical Scoring Function. The explanation gives information about the boost, queryNorm, idf, and term weights. In contrast, the filter explanation only reports the boost and queryNorm components of the Practical Scoring Function, not the idf or term weights.
Examples of responses with explanations are below. Note that I've removed many fields from my example hit and simplified the content, so term frequencies in the explanation will not match the actual content, other than the matched word (in this case "fact"). These responses are for the same event. My issue is that additional hits are included in the filter response that aren't included in the query response. Their explanations look identical.
Query:
curl -XPOST "http://localhost:9200/index-name/example-type/_search" -H "Content-Type: application/json" -d'{"query":{"query_string":{"query":"rapt~"}},"explain":true}'
And query response:
{
"_source": {
"type": "example",
"content": "to the fact that"
},
"_explanation": {
"value": 0.10740301,
"description": "sum of:",
"details": [
{
"value": 0.10740301,
"description": "weight(_all:fact^0.5 in 465) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.10740301,
"description": "score(doc=465,freq=2.0), product of:",
"details": [
{
"value": 0.11091774,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.5,
"description": "boost"
},
{
"value": 7.303468,
"description": "idf(docFreq=68, maxDocs=37706)"
},
{
"value": 0.03037399,
"description": "queryNorm"
}
]
},
{
"value": 0.96831226,
"description": "fieldWeight in 465, product of:",
"details": [
{
"value": 1.4142135,
"description": "tf(freq=2.0), with freq of:",
"details": [
{
"value": 2,
"description": "termFreq=2.0"
}
]
},
{
"value": 7.303468,
"description": "idf(docFreq=68, maxDocs=37706)"
},
{
"value": 0.09375,
"description": "fieldNorm(doc=465)"
}
]
}
]
}
]
}
]
}
}
Filter:
curl -XPOST "http://localhost:9200/index-name/example-type/_search" -H "Content-Type: application/json" -d'{"query":{"filtered":{"filter":{"fquery":{"query":{"query_string":{"query":"rapt~"}}}}}},"explain":true}'
And filter response:
{
"_source": {
"type": "example",
"content": "to the fact that"
},
"_explanation": {
"value": 1,
"description": "ConstantScore(cache(+_type:example-type +org.elasticsearch.index.search.nested.NonNestedDocsFilter@737a6633)), product of:",
"details": [
{
"value": 1,
"description": "boost"
},
{
"value": 1,
"description": "queryNorm"
}
]
}
}
When I wrap the filter in a constant score query, I get exactly the same set of results as the filter (again, more than the query), but the explanation looks a little cleaner:
Constant-score query wrapped filter:
curl -XPOST "http://localhost:9200/index-name/example-type/_search" -H "Content-Type: application/json" -d'{"query":{"constant_score":{"filter":{"query":{"query_string":{"query":"rapt~"}}}}},"explain":true}'
And constant-score query wrapped filter response:
{
"_source": {
"type": "example",
"content": "to the fact that"
},
"_explanation": {
"value": 1,
"description": "ConstantScore(QueryWrapperFilter(_all:rapt~2)), product of:",
"details": [
{
"value": 1,
"description": "boost"
},
{
"value": 1,
"description": "queryNorm"
}
]
}
}
Because the filter returns more results than the query, my guess is that the Practical Scoring Function ends up scoring documents that match the query with a score of 0. However, for a document that "matches" the query, none of the components of the scoring function should be zero.
Edit: I have recreated this issue on a smallish cluster of 238 documents (Note that the content of the documents is generated from an ngram language model trained on Wikipedia text.). I have posted both the cluster and the json events on dropbox. In order to see the issue on this data, run the following query, which returns the event with id=138:
{
"explain": true,
"query": {
"bool": {
"must_not": [
{
"query_string": {
"query": "rap~",
"fields": [
"body"
]
}
}
],
"must": [
{
"constant_score": {
"filter": {
"query": {
"query_string": {
"query": "rap~",
"fields": [
"body"
]
}
}
}
}
}
]
}
}
}
contant_score
? Have you tried just using the query there instead? – femtoRgoncURL
command? Feel free to obfuscate the index/type names. – pickypgorg.elasticsearch.index.search.nested.NonNestedDocsFilter
, which implies something weird is happening in your filtered version. My filtered expression is"description": "ConstantScore(cache(QueryWrapperFilter(_all:titl~2))), product of:"
without the other stuff, implying that your filter is being modified somehow before it gets into ES. – pickypg