Solr Hierarchy Numbering

Question

The example on Solr's wiki page shows a kind of indexed hierarchy nodes:

Doc#1: 0/NonFic, 1/NonFic/Law
Doc#2: 0/NonFic, 1/NonFic/Sci
Doc#3: 0/NonFic, 1/NonFic/Hist

How do I index my paths to achieve this? Do I manually split my paths, count the nodes, and generate these terms myself and store them as an array in Solr (multiValued field) or is it possible to configure Solr's path hierarchy tokenizer to apply the indexes itself?

For reference, I though about generating the paths like this:

public class DocumentPathBuilder {

    private List<String> nodes = new ArrayList<>();

    public static DocumentPathBuilder newInstance() {
        return new DocumentPathBuilder();
    }

    public static String escapeText(String input) {
        if (input == null)
            throw new NullPointerException("Cannot escape null input!");
        return input.replaceAll(ESearchDocumentPath.HIERARCHY_SEPERATOR, "").toUpperCase().trim();
    }

    public DocumentPathBuilder add(String node) {
        nodes.add(escapeText(node));
        return this;
    }

    public DocumentPathBuilder add(Collection<String> nodes) {
        this.nodes.addAll(nodes.stream()
                .map(n->escapeText(n))
                .collect(Collectors.toList())
        );
        return this;
    }

    public List<String> build() {
        List<String> result = new ArrayList<>();
        for (int i = 0; i < nodes.size(); i++) {
            StringJoiner joiner = new StringJoiner(ESearchDocumentPath.HIERARCHY_SEPERATOR);
            joiner.add(""+i);
            for (int j = 0; j <= i; j++) {
                joiner.add(nodes.get(j));
            }
            result.add(joiner.toString()+ESearchDocumentPath.HIERARCHY_SEPERATOR);
        }
        return result;
    }
}

Example input:

  List<String> build = DocumentPathBuilder.newInstance()
                .add("A")
                .add("350")
                .add(Arrays.asList("350-01", "FIGUTZRg"))
                .build();

Output entries:

0 = "0>A>"
1 = "1>A>350>"
2 = "2>A>350>350-01>"
3 = "3>A>350>350-01>FIGUTZRG>"

Also, what is the difference? If I store my generated values in a multiValued field, do I get the same result If Solr would have generated it with path tokenizer?

MatsLindh MatsLindh · Accepted Answer · 2017-11-29T10:01:06

From the page you're referencing:

You must perform some index time processing on this flattened data in order to create the tokens needed for a facet.prefix approach. When we index the data we create specially formatted terms that encode the depth information for each node that appears as part of the path, and include the hierarchy separated by a common separator (“depth/first level term/second level term/etc”). We also add additional terms for every ancestors in the original data.

So this is not built in to the Path Hierarchy Tokenizer, where the example also shows what the resulting tokens look like (and there is no n value present):

<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/>
  </analyzer>
</fieldType>

In: "c:\usr\local\apache"

Out: "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

Solr Hierarchy Numbering

1 Answers