3
votes

I have a basic aggregation on an index with about 40 million documents.

{
    aggs: {
        countries: {
            filter: {
                bool: {
                    must: my_filters,
                }
            },
            aggs: {
                filteredCountries: {
                    terms: {
                        field: 'countryId',
                        min_doc_count: 1,
                        size: 15,
                    }
                }
            }
        }
    }
}

The index:

{
    "settings": {
        "number_of_shards": 5, 
        "analysis": {
            "filter": {
                "autocomplete_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter",
                        "unique"
                    ]
                }
            }
        },
    },
    "mappings": {
        "properties": {
            "id": {
                "type": "integer"
            },
            "name": {
                "type": "text",
                "analyzer": "autocomplete",
                "search_analyzer": "standard"
            },
            "countryId": {
                "type": "short"
            }
        }
    }
}

The search response time is 100ms, but the aggregation response time is about 1.5s, and is increasing as we add more documents (was about 200ms with 5 million documents). There are about 20 distinct countryId right now.

What I tried so far:

  1. Allocating more RAM (from 4GB to 32GB), same results.
  2. Changing countryId field data type to keyword and adding eager_global_ordinals option, it made things worse

The elasticsearch version is 7.8.0, elastic has 8GB of ram, the server has 64GB of ram and 16CPU, 5 shards, 1 node

I use this aggregation to put filters in search results, so I need it to respond as fast as possible. For large number of results I don't need precision. so if it is approximate or even limited to a number (ex. 100 gte) it's great.

Any ideas how to speed up this aggregation ?

1

1 Answers

0
votes

Reason for the slowness:

  1. Bucket explosion is the reason. And breadth first collect mode would speed up further.

As per the doc, you can optimize further with breadth first collect mode.

Even though the number of actors may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets during calculation - a single actor can produce n² buckets where n is the number of actors. To find 10 popular actors and thir 5 top coactors.

  1. I would suggest you to set Execution hint. Since you have very less unique values, I suggest you to set hint as map.

  2. Another optimization, let's say some documents are not accessed in last few weeks, you can use a field from your filter, to partition the aggregation on particular set of documents.

  3. Another optimization that you could exclude, include what countries needed, if possible in your use case. Filter