Elasticsearch - aggregating multi level hierarchy

Question

I am facing a problem with providing aggregated search result of documents with multi level hierarchy. Simplified documents structure looks like this:

Magazine title (Hunting) -> Magazine year (1999) -> Magazine issue (II.) -> Pages (Text of pages ...)

Every level od document is mapped to its parent by attribute "parentDocumentId".

I have prepared simple query, which works just fine for hierarchy with just 2 levels:

POST http://localhost:9200/my_index/document/_search?search_type=count&q=hunter
{
  "query": { 
    "multi_match" : {
        "query":    "hunter", 
        "fields": [ "title", "text", "labels" ] 
    }
  },
    "aggregations": {
      "my_agg": {
        "terms": {
          "field": "parentDocumentId"
         }
      }
  }
}

This query is able to search through text of pages, and istead of giving me thousands of pages containting work "hunter" returns buckets (aggregated by parentDocumentId) of documents. However these buckets represent just "Magazine issues" which containt these pages.

Response:

{
   "took": 54,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 44,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "my_agg": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": 5,
               "doc_count": 43
            },
            {
               "key": 0,
               "doc_count": 1
            }
         ]
      }
   }
}

What I need, is to be able to aggregate search results on highest possible level. That means, in this particular case, to aggregate on "Magazine title" level. This could be done outside the elasticsearch query (on our application side), but as I see this, it should be definitely made in elasticsearch (performance, and other issues).

Does anybody have experience with similar aggregation? Is elasticsearch aggregations the right approach to use?

Every idea is welcome.

Thanks Peter

Update: Our mapping looks like this:

{
   "my_index": {
      "mappings": {
         "document": {
            "properties": {
               "dateIssued": {
                  "type": "date",
                  "format": "dateOptionalTime"
               },
               "documentId": {
                  "type": "long"
               },
               "filter": {
                  "properties": {
                     "geo_bounding_box": {
                        "properties": {
                           "issuedLocation": {
                              "properties": {
                                 "bottom_right": {
                                    "properties": {
                                       "lat": {
                                          "type": "double"
                                       },
                                       "lon": {
                                          "type": "double"
                                       }
                                    }
                                 },
                                 "top_left": {
                                    "properties": {
                                       "lat": {
                                          "type": "double"
                                       },
                                       "lon": {
                                          "type": "double"
                                       }
                                    }
                                 }
                              }
                           }
                        }
                     }
                  }
               },
               "issuedLocation": {
                  "type": "geo_point"
               },
               "labels": {
                  "type": "string"
               },
               "locationLinks": {
                  "type": "geo_point"
               },
               "parentDocumentId": {
                  "type": "long"
               },
               "query": {
                  "properties": {
                     "match_all": {
                        "type": "object"
                     }
                  }
               },
               "storedLocation": {
                  "type": "geo_point"
               },
               "text": {
                  "type": "string"
               },
               "title": {
                  "type": "string"
               },
               "type": {
                  "type": "string"
               }
            }
         }
      }
   }
}

That means we use 1 mapping for all types of documents. We are indexing set of books, newspapers and other press. That means, that sometimes there is only one parent for set of pages, any sometimes there are multiple levels of parents above the pages level.

To distinguish the type of document there is an attribute "type".

When indexing top levels (these contain especially book meta-data) we leave the "text" attribute empty, always specifying the parent of document using the parentDocumentId. The top level documents have their parentDocumentId set to 0. When indexing the lowest level (pages), we provide only text attribute and parentDocumentId for indexed document.

The link used is very similar to classic one-to-many mapping (magazine has many years, has many issues, has many pages).

You could also say, that we have flattened the nested documents in elasticsearch, but the reason for this is, that there are multiple document types, that can have different level of their hierarchy.

Could you post a specific example of a document with the full hierarchy? (not necessarily with all the properties). It would be helpful if you also included the mapping. As it stands, it's totally unclear how you index your docs. Is it one nested document that is flattened in Elasticsearch? Is it a nested doc with the nested type? Is it one doc per hierarchy level that just reference each other with a PK/FK pairs as in relational databases? — Jakub Kotowski
@jkbkot - thank you very much for your comment. I have updated the mapping description and also described how the documents are indexed. — shimon001

Jakub Kotowski Jakub Kotowski · Accepted Answer · 2015-05-04T12:24:18

You need to rethink your data modelling. In essence, you need a join over your data and moreover the join needs to be over an arbitrarily deep hierarchy. That is a problem even in relational databases let alone in a fulltext search engine like Elasticsearch.

Elasticsearch does support a couple of joins. You could use nested documents - a single document with all the subdocs nested. That's clearly not ideal in your case.

You could use the parent-child relationship feature which lets you index your (sub-)docs separately always referring to their parent. Underneath, that feature uses Lucene's blockjoin. However, to aggregate over a hierarchy, you would have to explicitly specify the join - listing all the intermediate steps. You want to always aggregate by the top-most available doc but that could be a different level each time (once a magazine, another time a magazine collection or perhaps a publisher).

I would consider indexing each doc with a field pointing to the top-most document. Then you can easily aggregate by that field. It would mean precomputing a part of the complex aggregation you want to do but it would result in fast aggregations and updates also wouldn't be very painful. It all depends on the source of your data, how you imagine that it will change, what updates and other queries you'll need to do.

This blog post could help to guide you a bit too: https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Elasticsearch - aggregating multi level hierarchy

1 Answers