Retrieve document frequency for terms in query result with aggregations

Question

For some of my queries to ElasticSearch I want three pieces of information back:

Which terms T occurred in the result document set?
How often does each element of T occur in the result document set?
How often does each element of T occur in the entire index (--> document frequency)?

The first points are easily determined using the default term facet or, nowadays, by the term aggregation method. So my question is really about the third point. Before ElasticSearch 1.x, i.e. before the switch to the 'aggregation' paradigm, I could use a term facet with the 'global' option set to true and a QueryFilter to get the document frequency ('global counts') of the exact terms occurring in the document set specified by the QueryFilter. At first I thought I could do the same thing using a global aggregation, but it seems I can't. The reason is - if I understand correctly - that the original facet mechanism were centered around terms whereas the aggregation buckets are defined by the the set of documents belonging to each bucket. I.e. specifying the global option of a term facet with a QueryFilter first determined the terms hit by the filter and then computed facet values. Since the facet was global I would receive the document counts.

With aggregations, it's different. The global aggregation can only be used as a top aggregation, causing the aggregation to ignore the current query results and compute the aggregation - e.g. a terms aggregation - on all documents in the index. So for me, that's too much, since I WANT to restrict the returned terms ('buckets') to the terms in the document result set. But if I use a filter-sub-aggregation with a terms-sub-aggregation, I would restrict the term-buckets to the filter again, thus not retrieving the document frequencies but normal facet counts. The reason is that the buckets are determined after the filter so they are "too small". But I don't want restrict bucket size, I want to restrict the buckets to the terms in the query result set.

How can I get the document frequency of those terms in a query result set using aggregations (since facets are deprecated and will be removed)?

Thanks for your time!

EDIT: Here comes an example of how I tried to achieve the desired behaviour. I will define two aggregations:

global_agg_with_filter_and_terms
global_agg_with_terms_and_filter

Both have a global aggregation at their tops because its the only valid position for it. Then, in the first aggregation, I first filter the results to the original query and then apply a term-sub-aggregation. In the second aggregation, I do mostly the same, only that here the filter aggregation is a sub-aggregation of the terms aggregation. Hence the similar names, only the order of aggregation differs.

{
    "query": {
        "query_string": {
            "query": "text: my query string"
        }
    },
    "aggs": {
        "global_agg_with_filter_and_terms": {
            "global": {},
            "aggs": {
                "filter_agg": {
                    "filter": {
                        "query": {
                            "query_string": {
                                "query": "text: my query string"
                            }
                        }
                    },
                    "aggs": {
                        "terms_agg": {
                            "terms": {
                                "field": "facets"
                            }
                        }
                    }
                }
            }
        },
        "global_agg_with_terms_and_filter": {
            "global": {},
            "aggs": {
                "document_frequency": {
                    "terms": {
                        "field": "facets"
                    },
                    "aggs": {
                        "term_count": {
                            "filter": {
                                "query": {
                                    "query_string": {
                                        "query": "text: my query string"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

Response:

{
    "took": 18,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 221,
        "max_score": 0.9839197,
        "hits": <omitted>
    },
    "aggregations": {
        "global_agg_with_filter_and_terms": {
            "doc_count": 1978,
            "filter_agg": {
                "doc_count": 221,
                "terms_agg": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                        {
                            "key": "fid8",
                            "doc_count": 155
                        },
                        {
                            "key": "fid6",
                            "doc_count": 40
                        },
                        {
                            "key": "fid9",
                            "doc_count": 10
                        },
                        {
                            "key": "fid5",
                            "doc_count": 9
                        },
                        {
                            "key": "fid13",
                            "doc_count": 5
                        },
                        {
                            "key": "fid7",
                            "doc_count": 2
                        }
                    ]
                }
            }
        },
        "global_agg_with_terms_and_filter": {
            "doc_count": 1978,
            "document_frequency": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                    {
                        "key": "fid8",
                        "doc_count": 1050,
                        "term_count": {
                            "doc_count": 155
                        }
                    },
                    {
                        "key": "fid6",
                        "doc_count": 668,
                        "term_count": {
                            "doc_count": 40
                        }
                    },
                    {
                        "key": "fid9",
                        "doc_count": 67,
                        "term_count": {
                            "doc_count": 10
                        }
                    },
                    {
                        "key": "fid5",
                        "doc_count": 65,
                        "term_count": {
                            "doc_count": 9
                        }
                    },
                    {
                        "key": "fid7",
                        "doc_count": 63,
                        "term_count": {
                            "doc_count": 2
                        }
                    },
                    {
                        "key": "fid13",
                        "doc_count": 55,
                        "term_count": {
                            "doc_count": 5
                        }
                    },
                    {
                        "key": "fid10",
                        "doc_count": 11,
                        "term_count": {
                            "doc_count": 0
                        }
                    },
                    {
                        "key": "fid11",
                        "doc_count": 9,
                        "term_count": {
                            "doc_count": 0
                        }
                    },
                    {
                        "key": "fid12",
                        "doc_count": 5,
                        "term_count": {
                            "doc_count": 0
                        }
                    }
                ]
            }
        }
    }
}

At first, please have a look at the first two returned term-buckets of both aggregations, with keys fid8 and fid6. We can easily see that those terms have been appearing in the result set 155 and 40 times, respectively. Now please look at the second aggregation, global_agg_with_terms_and_filter. The terms-aggregation is within the scope of the global aggregation, so here we can actually see the document frequencies, 1050 and 668, respectively. So this part looks good. The issue arises when you scan the list of term buckets further down, to the buckets with the keys fid10 to fid12. While we receive their document frequency, we can also see that their term_count is 0. This is due to the fact that those terms did not occur in our query, that we also used for the filter-sub-aggregation. So the problem is that for ALL terms (global scope!) their document frequency and their facet count with regards to the actual query result is returned. But I need this to be made exactly for the terms that occurred in the query result, i.e. for those exact terms returned by the first aggregation global_agg_with_filter_and_terms.

Perhaps there is a possibity to define some kind of filter that removes all buckets where their sub-filter-aggregation term_count has a zero doc_count?

Can you exemplify with a query and result set what you have tried so far, what doesn't work and what's the desired behavior? I know you have described the behavior, but I'd like to see an example. — Andrei Stefan
Thanks for comment. You're right, example always make things easier to understand. I hope the edit of my post makes the situation more understandable. — khituras

Shadocko Shadocko · Accepted Answer · 2015-06-17T15:30:09

Hello and sorry if the answer is late.

You should have a look at the Significant Terms aggregation as, like the terms aggregation, it returns one bucket for each term occuring in the results set with the number of occurences available through doc_count, but you also get the number of occurrences in a background set through bg_count. This means it only creates buckets for terms appearing in documents of your query results set.

The default background set comprises all documents in the query scope, but can be filtered down to any subset you want using background_filter.

You can use a scripted bucket scoring function to rank the buckets the way you want by combining several metrics:

_subset_freq: number of documents the term appears in the results set,
_superset_freq: number of documents the term appears in the background set,
_subset_size: number of documents in the results set,
_superset_size: number of documents in the background set.

Request:

{
  "query": {
    "query_string": {
      "query": "text: my query string"
    }
  },
  "aggs": {
    "terms": {
      "significant_terms": {
        "script": "_subset_freq",
        "size": 100
      }
    }
  }
}

Retrieve document frequency for terms in query result with aggregations

1 Answers