Elasticsearch query and filter give different doc count when using a lucene fuzzy operator

Question

Using ElasticSearch v1.7.2 and a fairly large index, I'm getting a different doc count for the following two searches, which use a fuzzy search in a query_string.

Query:

{
  "query": {
     "query_string": {
        "query": "rapt~4"
     }
  }
}

Filter:

{
 "filter": {
    "query": {
       "query_string": {
          "query": "rapt~4"
       }
    }
 }
}

The filter gives about 5% more results than the query. Why would the document counts be different? Are there options that I can specify to make them consistent?

Note that this inconsistency only occurs when I use a moderately sized dataset. I have tried inserting just a few (<10) documents that match the filter but not the query into a clean cluster, after which both my query and my filter successfully do match all documents. However, in a cluster of a single index, a single type, and a couple hundred documents, I start to see the discrepancy.

Using the explain=true option, it appears that the query score is computed using the Practical Scoring Function. The explanation gives information about the boost, queryNorm, idf, and term weights. In contrast, the filter explanation only reports the boost and queryNorm components of the Practical Scoring Function, not the idf or term weights.

Examples of responses with explanations are below. Note that I've removed many fields from my example hit and simplified the content, so term frequencies in the explanation will not match the actual content, other than the matched word (in this case "fact"). These responses are for the same event. My issue is that additional hits are included in the filter response that aren't included in the query response. Their explanations look identical.

Query:

curl -XPOST "http://localhost:9200/index-name/example-type/_search" -H "Content-Type: application/json" -d'{"query":{"query_string":{"query":"rapt~"}},"explain":true}'

And query response:

{
"_source": {
  "type": "example",
  "content": "to the fact that"
},
"_explanation": {
  "value": 0.10740301,
  "description": "sum of:",
  "details": [
    {
      "value": 0.10740301,
      "description": "weight(_all:fact^0.5 in 465) [PerFieldSimilarity], result of:",
      "details": [
        {
          "value": 0.10740301,
          "description": "score(doc=465,freq=2.0), product of:",
          "details": [
            {
              "value": 0.11091774,
              "description": "queryWeight, product of:",
              "details": [
                {
                  "value": 0.5,
                  "description": "boost"
                },
                {
                  "value": 7.303468,
                  "description": "idf(docFreq=68, maxDocs=37706)"
                },
                {
                  "value": 0.03037399,
                  "description": "queryNorm"
                }
              ]
            },
            {
              "value": 0.96831226,
              "description": "fieldWeight in 465, product of:",
              "details": [
                {
                  "value": 1.4142135,
                  "description": "tf(freq=2.0), with freq of:",
                  "details": [
                    {
                      "value": 2,
                      "description": "termFreq=2.0"
                    }
                  ]
                },
                {
                  "value": 7.303468,
                  "description": "idf(docFreq=68, maxDocs=37706)"
                },
                {
                  "value": 0.09375,
                  "description": "fieldNorm(doc=465)"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}
}

Filter:

curl -XPOST "http://localhost:9200/index-name/example-type/_search" -H "Content-Type: application/json" -d'{"query":{"filtered":{"filter":{"fquery":{"query":{"query_string":{"query":"rapt~"}}}}}},"explain":true}'

And filter response:

{
"_source": {
  "type": "example",
  "content": "to the fact that"
},
"_explanation": {
  "value": 1,
  "description": "ConstantScore(cache(+_type:example-type +org.elasticsearch.index.search.nested.NonNestedDocsFilter@737a6633)), product of:",
  "details": [
    {
      "value": 1,
      "description": "boost"
    },
    {
      "value": 1,
      "description": "queryNorm"
    }
  ]
}
}

When I wrap the filter in a constant score query, I get exactly the same set of results as the filter (again, more than the query), but the explanation looks a little cleaner:

Constant-score query wrapped filter:

curl -XPOST "http://localhost:9200/index-name/example-type/_search" -H "Content-Type: application/json" -d'{"query":{"constant_score":{"filter":{"query":{"query_string":{"query":"rapt~"}}}}},"explain":true}'

And constant-score query wrapped filter response:

{
"_source": {
  "type": "example",
  "content": "to the fact that"
},
"_explanation": {
  "value": 1,
  "description": "ConstantScore(QueryWrapperFilter(_all:rapt~2)), product of:",
  "details": [
    {
      "value": 1,
      "description": "boost"
    },
    {
      "value": 1,
      "description": "queryNorm"
    }
  ]
}
}

Because the filter returns more results than the query, my guess is that the Practical Scoring Function ends up scoring documents that match the query with a score of 0. However, for a document that "matches" the query, none of the components of the scoring function should be zero.

Edit: I have recreated this issue on a smallish cluster of 238 documents (Note that the content of the documents is generated from an ngram language model trained on Wikipedia text.). I have posted both the cluster and the json events on dropbox. In order to see the issue on this data, run the following query, which returns the event with id=138:

{
 "explain": true,
 "query": {
    "bool": {
       "must_not": [
          {
             "query_string": {
                "query": "rap~",
                "fields": [
                   "body"
                ]
             }
          }
       ],
       "must": [
          {
             "constant_score": {
                "filter": {
                   "query": {
                      "query_string": {
                         "query": "rap~",
                         "fields": [
                            "body"
                         ]
                      }
                   }
                }
             }
          }
       ]
    }
 }
}

Why did you make it a filter in the contant_score? Have you tried just using the query there instead? — femtoRgon
Good point. Using a constant_score query without a filter gives the same results as the simple query. The filter bit is what makes the difference. The following gives equivalent results to the constant_score query above: { "filter": { "query": { "query_string": { "query": "rapt~" } } } } I've updated my question. Thanks! — Ann Irvine
you are querying the _all field, which is analyzed. When using the query, your query is passed through the analyzer. When using the filter, it is maybe not or in a slightly different way. Try adding explain:true or using one of the other debug features in elasticsearch. — Jilles van Gurp
Can you also add the request that you are using, including the cURL command? Feel free to obfuscate the index/type names. — pickypg
It's weird because you have the org.elasticsearch.index.search.nested.NonNestedDocsFilter, which implies something weird is happening in your filtered version. My filtered expression is "description": "ConstantScore(cache(QueryWrapperFilter(_all:titl~2))), product of:" without the other stuff, implying that your filter is being modified somehow before it gets into ES. — pickypg

pickypg pickypg · Accepted Answer · 2016-01-06T22:13:38

In versions of Elasticsearch before Elasticsearch 5.x, filter at the top level indicated a post_filter. Post filters are generally only relevant when using aggregations.

Starting with Elasticsearch 5.0 (and later), you must explicitly say post_filter to avoid this confusion.

As such, the difference is that your top query is literally limiting the results to a set of matching documents. The post filter effectively matches everything, then removes results from the hits only without impacting the count.

...it appears that the query score is computed using...

Queries always compute scores and they are intended to help to properly sort items based on their relevance (score). Filters never compute scores; filters are intended for purely boolean logic that does not impact "relevance" beyond inclusion/exclusion.

To be fair, you can convert any query into a filter in multiple ways in Elasticsearch 1.x (in 2.x, all queries are also filters in the right context!), but I tend to use fquery. If you do this, then you should get the same results:

As a query:

{
  "query": {
     "query_string": {
        "query": "rapt~"
     }
  }
}

As a filter:

{
  "query": {
    "filtered": {
      "filter": {
        "fquery": {
          "query": {
            "query_string": {
              "query": "rapt~"
            }
          }
        }
      }
    }
  }
}

In ES 2.x, the filter simplifies too (and the query is unchanged):

{
  "query": {
    "bool": {
      "filter": {
        "query_string": {
          "query": "rapt~"
        }
      }
    }
  }
}

Elasticsearch query and filter give different doc count when using a lucene fuzzy operator

1 Answers