I have a potentially infinite set of symbols: A, B, C, ... There is also a distinct special placeholder symbol ? (its meaning will be explained below).
Consider non-empty finite trees such that every node has a symbol attached to it and 0 or more non-empty sub-trees. The order of sub-trees of a given node is significant (so, for example, if there is a node with 2 sub-trees, we can distinguish which one is left and which one is right). Any given symbol can appear in the tree 0 of more times attached to different nodes. The placeholder symbol ? can be attached only to leaf nodes (i.e. nodes having no sub-trees). It follows from the usual definition of a tree that trees are acyclic.
The finiteness requirement means that the total number of nodes in a tree is a positive finite integer. It follows that the total number of attached symbols, the tree depth and the total number of nodes in every sub-tree are all finite.
Trees are given in a functional notation: a node is represented by a symbol attached to it and, if there are any sub-trees, it is followed by parentheses containing comma-separated list of sub-trees represented recursively in the same way. So, for example the tree
A
/ \
? B
/ \
A C
/|\
A C Q
\
?
is represented as A(?,B(A(A,C,Q(?)),C)).
I have a pre-calculated unchanging set of trees S that will be used as patterns to match. The set will typically have ~ 105 trees, and every its element will typically have ~ 10-30 nodes. I can use a plenty of time to create beforehand any representation of S that will best suit my problem stated below.
I need to write a function that accepts a tree T (typically with ~ 102 nodes) and checks as fast as possible if T contains as a subtree any element of S, provided that any node with placeholder symbol ? matches any non-empty subtree (both when it appears in T or in an element of S).
Please suggest a data structure to store the set S and an algorithm to check for a match. Any programing language or a pseudo-code is OK.
A(?)considered as a pattern means a node with the symbolAattached that has exactly one sub-tree. So, it does not matchA(B,C)orA, but it would matchA(B)orA(B(C,D)). - TauMuA(,B). The notationQ(?)denotes the nodeQwith the single sub-tree consisting of the single node with the placeholder symbol?. If considered as a pattern, it matches any tree with the root nodeQand exactly one non-empty sub-tree of any form, for example,Q(A)orQ(A(B,C(D)))... - TauMuQ(?)does not matchQ(0 sub-trees),Q(A,B)(2 sub-trees) orA(Q)(wrong symbol at the root). It matches?though, because the placeholder?stands for any tree, including those placeholder-free trees that could be matched byQ(?). So, given a mere possibility of a match, we consider the match successful. It could be also perceived that because?is a more general (more inclusive) pattern thanQ(?), the match is effectively performed in the other direction (i.e.?matches any tree includingQ(?)). Sorry, I did not explained this last point in my question. - TauMu?, the match is considered successful iff there exist a placeholder-free tree that could be successfully matched by both 'pattern' and 'data' trees (in other words, when the patterns overlap). - TauMu