1
votes

I encountered a problem, as I need to sort my buckets using a keyword field for this I have tried two approaches.

  1. I have been trying to sort the result of my aggregation (buckets) from the top hit aggregation. My top_hits contains one element which is the username
"user_data": {
          "top_hits": {
            "_source": {
              "includes": ["username"]
            },
            "size": 1
          }
        },

To sort the buckets i'm trying with a bucket sort, the bucket sort is something like this

sorting": {
          "bucket_sort": {
          "sort": [
              {
                "user_data>username": {    ----> This is the error 
                "order": "desc"
              }
            }
            ],
            "from": 0,
            "size": 25
          }
        }

But I received a syntax error basically the bucket path is wrong.

  1. Another approach that I used to accomplish the sort was to add another aggregation over the username to obtain the max. Something like this
"to_sort" : {
          "max": {
            "field": "username"
          }
        }

And use the following bucket_sort

"sorting": {
          "bucket_sort": {
          "sort": [
              {
                "to_sort": {    
                "order": "desc"
              }
            }
            ],
            "from": 0,
            "size": 25
          }
        }

But basically I can't to use a keyword field to use the max aggregation. Is there a way to sort my buckets using the username, the username is a keyword field?

The parent of my aggregation is

"aggs": {
    "CountryId": {
      "terms": {
        "field": "countryId",
        "size": 10000
      }

The value of the username is different between each bucket

The result of the buckets is something like this

"buckets" : [
        {
          "key" : "11111",
          "doc_count" : 17,
          "user_data" : {
            "hits" : {
              "total" : 10,
              "max_score" : 11,
              "hits" : [
                {
                  "_index" : "index_name",
                  "_type" : "index_name",
                  "_id" : "101010",
                  "_score" : 0.0,
                  "_source" : {
                    "username" : "cccccc"
                  }
                }
              ]
            }
          }
        },
        {
          "key" : "33333",
          "doc_count" : 17,
          "user_data" : {
            "hits" : {
              "total" : 10,
              "max_score" : 11,
              "hits" : [
                {
                  "_index" : "index_name",
                  "_type" : "index_name",
                  "_id" : "101010",
                  "_score" : 0.0,
                  "_source" : {
                    "username" : "bbbbb"
                  }
                }
              ]
            }
          }
        },
{
          "key" : "22222",
          "doc_count" : 17,
          "user_data" : {
            "hits" : {
              "total" : 10,
              "max_score" : 11,
              "hits" : [
                {
                  "_index" : "index_name",
                  "_type" : "index_name",
                  "_id" : "101010",
                  "_score" : 0.0,
                  "_source" : {
                    "username" : "aaaaa"
                  }
                }
              ]
            }
          }
        }
]

And the following buckets result is I would like to have

"buckets" : [
        {
          "key" : "22222",
          "doc_count" : 17,
          "user_data" : {
            "hits" : {
              "total" : 10,
              "max_score" : 11,
              "hits" : [
                {
                  "_index" : "index_name",
                  "_type" : "index_name",
                  "_id" : "101010",
                  "_score" : 0.0,
                  "_source" : {
                    "username" : "aaaaa"
                  }
                }
              ]
            }
          }
        },
        {
          "key" : "33333",
          "doc_count" : 17,
          "user_data" : {
            "hits" : {
              "total" : 10,
              "max_score" : 11,
              "hits" : [
                {
                  "_index" : "index_name",
                  "_type" : "index_name",
                  "_id" : "101010",
                  "_score" : 0.0,
                  "_source" : {
                    "username" : "bbbbb"
                  }
                }
              ]
            }
          }
        },
{
          "key" : "11111",
          "doc_count" : 17,
          "user_data" : {
            "hits" : {
              "total" : 10,
              "max_score" : 11,
              "hits" : [
                {
                  "_index" : "index_name",
                  "_type" : "index_name",
                  "_id" : "101010",
                  "_score" : 0.0,
                  "_source" : {
                    "username" : "ccccc"
                  }
                }
              ]
            }
          }
        }
]

How you can see the buckets was order by username.

1
Can you show what is the parent aggregation and on which field it is computed? Also is there a unique username value per bucket?Val
I have added more details about my problem.Hatch

1 Answers

0
votes

I had a problem similar to this and didn't found any answer on the internet. So I tried to build my own, took me almost a week :/. It won't work always because of the limit on the ordered hashcode generation for strings, so you will have to play with your own charset and the length of the first chars on the string you deem enough to sort (6 for me), do some tests because you only want to use the positive interval of the long type or it will not work at all (due to my charset length I could go up to 13). I basically, build my metric for the bucket_sort using a scripted_metric based on finding the top_hits manually from here and adapted it to compute an ordered hashcode of my wanted keyword. Below is my query where I sort the user's last session top hits by sso.name keyword, it should be more or less easy for you to adapt it to your problem.

{
  "size": 0,
  "timeout": "60s",
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "user_id"
          }
        }
      ]
    }
  },
  "aggregations": {
    "by_user": {
      "terms": {
        "field": "user_id",
        "size": 10000,
        "order": [
          {
            "_count": "desc"
          },
          {
            "_key": "asc"
          }
        ]
      },
      "aggregations": {
        "my_top_hits_sso_ordered_hash": {
          "scripted_metric": {
            "init_script": "state.timestamp_latest = 0L; state.last_sso_ordered_hash = 0L",
            "map_script": """ 
              def current_date = doc['login_timestamp'].getValue().toInstant().toEpochMilli();
              if (current_date > state.timestamp_latest) {
                state.timestamp_latest = current_date;
                state.last_sso_ordered_hash = 0L;
                if(doc['sso.name'].size()>0) {
                  String charset = "abcdefghijklmnopqrstuvwxyz";
                  String ssoName = doc['sso.name'].value;
                  int length = charset.length(); 
                  for(int i = 0; i<Math.min(ssoName.length(), 6); i++) {
                    state.last_sso_ordered_hash = state.last_sso_ordered_hash*length + charset.indexOf(String.valueOf(ssoName.charAt(i))) + 1;
                  }
                }
              }
            """,
            "combine_script":"return state",
            "reduce_script": """ 
              def last_sso_ordered_hash = '';
              def timestamp_latest = 0L;
              for (s in states) {
                if (s.timestamp_latest > (timestamp_latest)) {
                  timestamp_latest = s.timestamp_latest; last_sso_ordered_hash = s.last_sso_ordered_hash;
                }
              }
              return last_sso_ordered_hash;
            """
          }
        },
        "user_last_session": {
          "top_hits": {
            "from": 0,
            "size": 1,
            "sort": [
              {
                "login_timestamp": {
                  "order": "desc"
                }
              }
            ]
          }
        },
        "pagination": {
          "bucket_sort": {
            "sort": [
              {
                "my_top_hits_sso_ordered_hash.value": {
                  "order": "desc"
                }
              }
            ],
            "from": 0,
            "size": 100
          }
        }
      }
    }
  }
}