0
votes

I'm currently evaluating whether to use elasticsearch or solr in a project and moving through the cases that need to be implemented. I found one case on which I couldn't find any documentation which felt a bit strange to me since the case seemed to be quite common to me. The categories are user supplied so I don't know them in advance. Consider the following part of a taxonomy with documents that can have multiple categories:

  • Root (3)
    • Books (2)
      • Sci-fi (1)
        • DocumentA
      • Fantasy (2)
        • DocumentA
        • DocumentC
    • Movies (1)
      • Action (1)
        • DocumentB
    • Games (1)
      • Adventure
        • DocumentB

In this case DocumentB could be an entry for e.g. Indiana Jones. Normal term hierarchies can be implemented using the path hierarchy tokenizer in solr/elastic, so DocumentC would have 'Root/Books/Fantasy' as category with a path split on '/'.

DocumentB however would need to have two paths ('Root/Movies/Action' and 'Root/Games/Adventure'). I thought about dynamically adding one category_n field per path for the document in elastic with the path hierarchy tokenizer and then do the category search on all the category_* fields, but i don't know if that would be the right approach, especially considering that the document count for the facets is not simple because the count of a parent node is not the sum of its children (documents can be in multiple child categories and should not be counted more than once).

What would be a good way to implement this in solr/elastic?

Cheers

1

1 Answers

0
votes

I ended up using ES and had a category field in which I put every path to the node. So 'Root/Movies/Action' and 'Root/Games/Adventure'. Then I used a path hierarchy tokenizer splitting on / with this field. ES supports putting multiple paths in that field and searching them. I then used an aggregation with bucketing on the categories, that yielded exactly what I wanted, documents are not counted multiple times if the occure more than one time in a branch.