Elasticsearch Terms or Cardinality Aggregation - Order by number of distinct values

Question

Friends,

I am doing some analysis to find unique pairs from 100s of millions of documents. The mock example is as shown below:

doc field1 field2

AAA : BBB
AAA : CCC
PPP : QQQ
PPP : QQQ
XXX : YYY
XXX : YYY
MMM : NNN

90% of the document contains an unique pair as shown above in doc 3, 4, 5, 6 and 7 which I am not interested on my aggregation result. I am interested to aggregate doc 1 and 2.

Terms Aggregation Query:



    "aggs": {
        "f1": {
            "terms": {
                "field": "FIELD1",
                "min_doc_count": 2
              },
              "aggs": {
                "f2": {
                      "terms": {
                    "field": "FIELD2"
                      }
                }
              }
        }
    }

Term Aggregation Result



    "aggregations": {
         "f1": {
               "buckets": [
                 {
                    "key": "PPP",
                    "doc_count": 2,
                    "f2": {
                        "buckets": [
                              {
                               "key": "QQQ",
                            "doc_count": 2
                              }
                         ]
                    }
                  },
                  {
                    "key": "XXX",
                    "doc_count": 2,
                    "f2": {
                        "buckets": [
                              {
                               "key": "YYY",
                            "doc_count": 2
                              }
                         ]
                    }
                  },
                  {
                    "key": "AAA",
                    "doc_count": 2,
                    "f2": {
                        "buckets": [
                              {
                                   "key": "BBB",
                                "doc_count": 1
                              },
                              {
                               "key": "CCC",
                               "doc_count": 1
                              }
                         ]
                    }
                  }
            ]
         }
    }

I am interested only on key AAA to be in the aggregation result. What is the best way to filter the aggregation result containing distinct pairs?

I tried with cardinality aggregation which result unque value count. However I am not able to filter out what I am not interested from the aggregation results.

Cardinality Aggregation Query



    "aggs": {
        "f1": {
            "terms": {
                "field": "FIELD1",
                "min_doc_count": 2
              },
            "aggs": {
                "f2": {
                      "cardinality": {
                        "field": "FIELD2"
                      }
                }
              }
        }
    }

Cardinality Aggregation Result



    "aggregations": {
        "f1": {
               "buckets": [
                 {
                    "key": "PPP",
                    "doc_count": 2,
                    "f2": {
                          "value" : 1
                    }
                  },
                  {
                    "key": "XXX",
                    "doc_count": 2,
                    "f2": {
                          "value" : 1
                    }
                  },
                  {
                    "key": "AAA",
                    "doc_count": 2,
                    "f2": {
                          "value" : 2
                    }
                  }
            ]
         }
    }

Atleast if I could sort by cardinal value, that would be help me to find some workarounds. Please help me in this regard.

P.S: Writing a spark/mapreduce program to post process/filter the aggregation result is not expected solution for this issue.

If you want to apply sorting on the cardinal value getting from the second aggregation, you can do by applying order in first aggregation like this "terms":{"field":"field1","min_doc_count":2,"order":{"f2":"desc/asc"}} — Vara Prasad

TonyLxc TonyLxc · Accepted Answer · 2014-10-26T06:54:21

I suggest to use filter query along with aggregations, since you are only interested in field1=AAA.

I have a similar example here.

For example, I have an index of all patients in my hospital. I store their drug use in a nested object DRUG. Each patient could take different drugs, and each could take a single drug for multiple times.

Now if I wanted to find the number of patients who took aspirin at least once, the query could be:

{
  "size": 0,
  "_source": false,
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "nested": {
          "path": "DRUG",
          "filter": {
            "bool": {
              "must": [{ "term": { "DRUG.NAME": "aspirin" } }]
  }}}}}},
  "aggs": {
    "DRUG_FACETS": {
      "nested": {
        "path": "DRUG"
      },
      "aggs": {
        "DRUG_NAME_FACETS": {
          "terms": { "field": "DRUG.NAME", "size": 0 },
          "aggs": {
            "DISTINCT": { "cardinality": { "field": "DRUG.PATIENT" } }
          }
  }}}}
}

Sample result:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 6,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "DRUG_FACETS": {
      "doc_count": 11,
      "DRUG_NAME_FACETS": {
        "buckets": [
          {
            "key": "aspirin",
            "doc_count": 6,
            "DISTINCT": {
              "value": 6
            }
          },
          {
            "key": "vitamin-b",
            "doc_count": 3,
            "DISTINCT": {
              "value": 2
            }
          },
          {
            "key": "vitamin-c",
            "doc_count": 2,
            "DISTINCT": {
              "value": 2
            }
          }
        ]
      }
    }
  }
}

The first one in the buckets would be aspirin. But you can see other 2 patients had also taken vitamin-b when they took aspirin.

If you change the field value of DRUG.NAME to another drug name for example "vitamin-b", I suppose you would get vitamin-b in the first position of the buckets.

Hopefully this is helpful to your question.