2
votes

Hope everyone is staying safe!

I am trying to explore the proper way to tacke the following use case in elasticsearch

Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.

So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs

To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.

As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/

It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.

Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....

But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).

I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.

Any help would be extremely appreciated!

Thanks!

1

1 Answers

1
votes

There are 3 ways to paginate aggregtation.

  1. Composite aggregation
  2. Partition
  3. Bucket sort

Partition you have already tried.

Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.

GET index22/_search
{
 "size": 0,
 "aggs": {
   "ValueCount": {
     "value_count": {
       "field": "id.keyword"
     }
   },
   "pagination": {
     "composite": {
       "size": 2,
       "sources": [
         {
           "TradeRef": {
             "terms": {
               "field": "id.keyword"
             }
           }
         }
       ]
     }
   }
 }
}

Bucket sort

The bucket_sort aggregation, like all pipeline aggregations, is executed after all other non-pipeline aggregations. This means the sorting only applies to whatever buckets are already returned from the parent aggregation. For example, if the parent aggregation is terms and its size is set to 10, the bucket_sort will only sort over those 10 returned term buckets

So this isn't suitable for your case

You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.

Better option is to use scroll api and perform distinct at client side