Friends,
I am doing some analysis to find unique pairs from 100s of millions of documents. The mock example is as shown below:
doc field1 field2
- AAA : BBB
- AAA : CCC
- PPP : QQQ
- PPP : QQQ
- XXX : YYY
- XXX : YYY
- MMM : NNN
90% of the document contains an unique pair as shown above in doc 3, 4, 5, 6 and 7 which I am not interested on my aggregation result. I am interested to aggregate doc 1 and 2.
Terms Aggregation Query:
"aggs": { "f1": { "terms": { "field": "FIELD1", "min_doc_count": 2 }, "aggs": { "f2": { "terms": { "field": "FIELD2" } } } } }
Term Aggregation Result
"aggregations": { "f1": { "buckets": [ { "key": "PPP", "doc_count": 2, "f2": { "buckets": [ { "key": "QQQ", "doc_count": 2 } ] } }, { "key": "XXX", "doc_count": 2, "f2": { "buckets": [ { "key": "YYY", "doc_count": 2 } ] } }, { "key": "AAA", "doc_count": 2, "f2": { "buckets": [ { "key": "BBB", "doc_count": 1 }, { "key": "CCC", "doc_count": 1 } ] } } ] } }
I am interested only on key AAA to be in the aggregation result. What is the best way to filter the aggregation result containing distinct pairs?
I tried with cardinality aggregation which result unque value count. However I am not able to filter out what I am not interested from the aggregation results.
Cardinality Aggregation Query
"aggs": { "f1": { "terms": { "field": "FIELD1", "min_doc_count": 2 }, "aggs": { "f2": { "cardinality": { "field": "FIELD2" } } } } }
Cardinality Aggregation Result
"aggregations": { "f1": { "buckets": [ { "key": "PPP", "doc_count": 2, "f2": { "value" : 1 } }, { "key": "XXX", "doc_count": 2, "f2": { "value" : 1 } }, { "key": "AAA", "doc_count": 2, "f2": { "value" : 2 } } ] } }
Atleast if I could sort by cardinal value, that would be help me to find some workarounds. Please help me in this regard.
P.S: Writing a spark/mapreduce program to post process/filter the aggregation result is not expected solution for this issue.
order
in first aggregation like this"terms":{"field":"field1","min_doc_count":2,"order":{"f2":"desc/asc"}}
– Vara Prasad