21
votes

I have a potentially infinite set of symbols: A, B, C, ... There is also a distinct special placeholder symbol ? (its meaning will be explained below).

Consider non-empty finite trees such that every node has a symbol attached to it and 0 or more non-empty sub-trees. The order of sub-trees of a given node is significant (so, for example, if there is a node with 2 sub-trees, we can distinguish which one is left and which one is right). Any given symbol can appear in the tree 0 of more times attached to different nodes. The placeholder symbol ? can be attached only to leaf nodes (i.e. nodes having no sub-trees). It follows from the usual definition of a tree that trees are acyclic.

The finiteness requirement means that the total number of nodes in a tree is a positive finite integer. It follows that the total number of attached symbols, the tree depth and the total number of nodes in every sub-tree are all finite.

Trees are given in a functional notation: a node is represented by a symbol attached to it and, if there are any sub-trees, it is followed by parentheses containing comma-separated list of sub-trees represented recursively in the same way. So, for example the tree

                    A
                   / \
                  ?   B
                     / \
                    A   C
                   /|\
                  A C Q
                       \
                        ?

is represented as A(?,B(A(A,C,Q(?)),C)).

I have a pre-calculated unchanging set of trees S that will be used as patterns to match. The set will typically have ~ 105 trees, and every its element will typically have ~ 10-30 nodes. I can use a plenty of time to create beforehand any representation of S that will best suit my problem stated below.

I need to write a function that accepts a tree T (typically with ~ 102 nodes) and checks as fast as possible if T contains as a subtree any element of S, provided that any node with placeholder symbol ? matches any non-empty subtree (both when it appears in T or in an element of S).

Please suggest a data structure to store the set S and an algorithm to check for a match. Any programing language or a pseudo-code is OK.

2
Try researching 'regular tree grammars' and tree automata. - Antimony
@tmyklebu A(?) considered as a pattern means a node with the symbol A attached that has exactly one sub-tree. So, it does not match A(B,C) or A, but it would match A(B) or A(B(C,D)). - TauMu
@jwpat7 I mentioned left and right sub-trees in the case when there is a node with two sub-trees. If a node has only one sub-tree, it does not matter how the sub-tree is positioned on the visual diagram. There are no empty sub-trees — so, no such things as A(,B). The notation Q(?) denotes the node Q with the single sub-tree consisting of the single node with the placeholder symbol ?. If considered as a pattern, it matches any tree with the root node Q and exactly one non-empty sub-tree of any form, for example, Q(A) or Q(A(B,C(D)))... - TauMu
@jwpat7 ...But Q(?) does not match Q (0 sub-trees), Q(A,B) (2 sub-trees) or A(Q) (wrong symbol at the root). It matches ? though, because the placeholder ? stands for any tree, including those placeholder-free trees that could be matched by Q(?). So, given a mere possibility of a match, we consider the match successful. It could be also perceived that because ? is a more general (more inclusive) pattern than Q(?), the match is effectively performed in the other direction (i.e. ? matches any tree including Q(?)). Sorry, I did not explained this last point in my question. - TauMu
@jwpat7 ...I think, a more clear explanation would be that in case when both a 'pattern' tree (an element of S) and a 'data' tree T contain placeholders ?, the match is considered successful iff there exist a placeholder-free tree that could be successfully matched by both 'pattern' and 'data' trees (in other words, when the patterns overlap). - TauMu

2 Answers

6
votes

This paper describes a variant of the Aho–Corasick algorithm, where instead of using a finite state machine (which the standard Aho–Corasick algorithm uses for string matching) the algorithm instead uses a pushdown automaton for subtree matching. Like the Aho-Corasick string-matching algorithm, their variant only requires one pass through the input tree to match against the entire dictionary of S.

The paper is quite complex - it may be worth it to contact the author to see if he has any source code available.

4
votes

What you need is a finite state machine that tracks the set of potential matches you might have.

In essence, such a machine is is the result of matching the patterns against each other, and determining what part of the individual matches they share. This is analogous to how lexers take sets of regular expressions for tokens and compose them into a large FSA that can match any of the regular expressions by processing characters one at a time.

You can find references to methods for doing this under term rewriting systems.