Aggregate over last documents in elasticsearch

Question

I'm a bit new to ES and I'm not sure how to do the following:

I run a search with a query which will only contain a few 'should' parameters.

Then also a few aggregations, e.g. a percentile, term bucketing, etc.

But for the aggregations, I only want it, e.g., aggregated over the first 1000 documents (which I hope is then scored and ordered by the score).

The idea is that I want the aggs for specific terms, but if not enough are found, then fill it up - but limited to a specific max number to aggregate on. From the docs it seems that size is the number of documents it returns, not the size that will be used for the aggs (I do not need hits, only aggs returned).

So how do I go about this? Is there a nested/subsequent query? Must I pipeline something, e.g. get the search for 1k docs, then agg that?

It would be ideal if the documents could first be sorted by the timestamp it was indexed - so that the documents that is used to 'fill up' are the latest - but AFAIK that is not possible?

Fill it up?

'fill it up' means I have 100 docs for one specified 'should' field specified. Then I still need the other 900 docs for the required the 1k result size to aggregate over (so to fill it up to the number required). So instead of using a filter I saw the 'combined queries' in the docs and I think using a 'should' parameter would suffice.

Can clarify what you mean by "fill it up"? Perhaps by including a json example of what you expect? — Phil
this might help. A filter with limit - stackoverflow.com/a/29127328/689625 — jay

Tjorriemorrie Tjorriemorrie · Accepted Answer · 2016-10-26T05:48:12

Solution:

        sample = A('sampler', shard_size=docs_per_shard)

In order to aggregate over a subset of documents, use the Sampler aggregator. That returns a subset of the documents. It requires a shard_size parameter which is how many docs per shard it must return. The value given is the required docs size (100) divided by the active shards (5).

        terms = A('terms', field='action')
        sea = GameAction.search()
        sea.aggs.bucket('mesam', sample).bucket('aksies', terms)

Having the subsample, it can now be aggregated by piping it. This gives the solution, but lets make it even better.

        sea = sea.sort('_score', {'created_at': 'desc'})

This will sort the docs by score and then by created date, meaning the most relevant docs are returned AND it is ordered giving the most recent first.

Furthermore:

        sea = sea.query('bool', boost=10, should=[Q('match', player=p['name'])])
        sea = sea.query('bool', boost=5, should=[Q('match', vs=vs)])
        sea = sea.query('bool', boost=2, should=[Q('match', phase=phase)])
        sea = sea.query('bool', boost=1, should=[Q('match', site='handhq')])
        sea = sea.query('bool', must=[
            ~Q('match', action='gg') &
            ~Q('match', action='sb') &
            ~Q('match', action='bb')])

Relevant here is the should. This allows the documents to be 'filled up' with most relevant docs where it matches or closely matched (and being sorted with latest where equal). These fields are mostly not_analyzed. Also, it can be boosted giving a very good solution to the problem.

Aggregate over last documents in elasticsearch

1 Answers