I am facing a problem with providing aggregated search result of documents with multi level hierarchy. Simplified documents structure looks like this:
Magazine title (Hunting) -> Magazine year (1999) -> Magazine issue (II.) -> Pages (Text of pages ...)
Every level od document is mapped to its parent by attribute "parentDocumentId".
I have prepared simple query, which works just fine for hierarchy with just 2 levels:
POST http://localhost:9200/my_index/document/_search?search_type=count&q=hunter
{
"query": {
"multi_match" : {
"query": "hunter",
"fields": [ "title", "text", "labels" ]
}
},
"aggregations": {
"my_agg": {
"terms": {
"field": "parentDocumentId"
}
}
}
}
This query is able to search through text of pages, and istead of giving me thousands of pages containting work "hunter" returns buckets (aggregated by parentDocumentId) of documents. However these buckets represent just "Magazine issues" which containt these pages.
Response:
{
"took": 54,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 44,
"max_score": 0,
"hits": []
},
"aggregations": {
"my_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 5,
"doc_count": 43
},
{
"key": 0,
"doc_count": 1
}
]
}
}
}
What I need, is to be able to aggregate search results on highest possible level. That means, in this particular case, to aggregate on "Magazine title" level. This could be done outside the elasticsearch query (on our application side), but as I see this, it should be definitely made in elasticsearch (performance, and other issues).
Does anybody have experience with similar aggregation? Is elasticsearch aggregations the right approach to use?
Every idea is welcome.
Thanks Peter
Update: Our mapping looks like this:
{
"my_index": {
"mappings": {
"document": {
"properties": {
"dateIssued": {
"type": "date",
"format": "dateOptionalTime"
},
"documentId": {
"type": "long"
},
"filter": {
"properties": {
"geo_bounding_box": {
"properties": {
"issuedLocation": {
"properties": {
"bottom_right": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
},
"top_left": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
}
}
}
}
}
}
},
"issuedLocation": {
"type": "geo_point"
},
"labels": {
"type": "string"
},
"locationLinks": {
"type": "geo_point"
},
"parentDocumentId": {
"type": "long"
},
"query": {
"properties": {
"match_all": {
"type": "object"
}
}
},
"storedLocation": {
"type": "geo_point"
},
"text": {
"type": "string"
},
"title": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
That means we use 1 mapping for all types of documents. We are indexing set of books, newspapers and other press. That means, that sometimes there is only one parent for set of pages, any sometimes there are multiple levels of parents above the pages level.
To distinguish the type of document there is an attribute "type".
When indexing top levels (these contain especially book meta-data) we leave the "text" attribute empty, always specifying the parent of document using the parentDocumentId. The top level documents have their parentDocumentId set to 0. When indexing the lowest level (pages), we provide only text attribute and parentDocumentId for indexed document.
The link used is very similar to classic one-to-many mapping (magazine has many years, has many issues, has many pages).
You could also say, that we have flattened the nested documents in elasticsearch, but the reason for this is, that there are multiple document types, that can have different level of their hierarchy.