1
votes

Shortly: with Elasticsearch, given a list of fields, how can I get the average number of missing fields per document as an aggregation?

Details

With the missing aggregation type I can get the total number of documents where a given field is missing. So with the following data:

"hits": [{
    "name": "A name",
    "nickname": "A nickname",
    "bestfriend": "A friend",
    "hobby": "An hobby"
},{
    "name": "A name",
    "hobby": "An hobby"
},{
    "name": "A name",
    "nickname": "A nickname",
    "hobby": "An hobby"
},{
    "name": "A name",
    "bestfriend": "A friend"
}]

I can run the following query:

{
    "aggs": {
        "name_missing": {
            "missing": {"field": "name"}
        },
        "nickname_missing": {
            "missing": {"field": "nickname"}
        },
        "hobby_missing": {
            "missing": {"field": "hobby"}
        },
        "bestfriend_missing": {
            "missing": {"field": "bestfriend"}
        }
    }
}

And I get the following aggregations:

...
"aggregations": {
    "name_missing": {
        "doc_count": 0
    },
    "nickname_missing": {
        "doc_count": 2
    },
    "hobby_missing": {
        "doc_count": 1
    },
    "bestfriend_missing": {
        "doc_count": 1
    }   
}
...

What I need now is to get the average number of missing fields for each document. I can just do the math by code on the results:

  • sum all the missing aggregations doc_count value
  • divide by the total number of hits

But how would you get the same result as an aggregation from Elasticsearch?

Thank you for any reply / suggestion.

1
Share your ES query.Hatim Stovewala
@HatimStovewala question has been updated. Thank you!Francesco Abeni

1 Answers

1
votes

This is an ugly solution but it does the trick.

GET missing/missing/_search
{
  "size": 0,
  "aggs": {
    "result": {
      "terms": {
        "script": "'aaa'"
      },
      "aggs": {
        "name_missing": {
          "missing": {
            "field": "name"
          }
        },
        "nickname_missing": {
          "missing": {
            "field": "nickname"
          }
        },
        "hobby_missing": {
          "missing": {
            "field": "hobby"
          }
        },
        "bestfriend_missing": {
          "missing": {
            "field": "bestfriend"
          }
        },
        "avg_missing": {
          "bucket_script": {
            "buckets_path": {            // This is kind of defining variables. name_missing._count will take the doc_count of the name_missing aggregation and same for others(nickname_missing,hobby_missing,bestfriend_missing) as well. "count":"_count" will take doc_count of the documents on which aggregation is performed(total no. of Hits).
              "name_missing": "name_missing._count",
              "nickname_missing": "nickname_missing._count",
              "hobby_missing": "hobby_missing._count",
              "bestfriend_missing": "bestfriend_missing._count",
              "count":"_count"
            },
            "script": "(name_missing+nickname_missing+hobby_missing+bestfriend_missing)/count" // Here we are adding all the missing values and dividing it by the total no. of Hits as you require.
          }
        }
      }
    }
  }
}

I've shown you how to do it, now its on you how you want to massage your parameters and extract what you intend to.