14
votes

The longest common substring problem according to wiki can be solved using a suffix tree.
From wiki:

The longest common substrings of a set of strings can be found by building a generalised suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it

I don't get this.
Example: if I have:
ABCDE and XABCZ
then the suffix tree is (some branches from XABCZ omitted due to space):
enter image description here

The longest common substring is ABC but it is not I can not see how the description of wiki helps here.
ABC is not the deepest internal nodes with leaf nodes.
Any help to understand how this works?

2
ABC is not the deepest internal nodes with leaf nodes. No, but ABC is the longest common string of nodes anywhere in the tree. The next longest ones are B-C and D-E, with two nodes each. - Robert Harvey
Yes ABC is the longest common string. But I don't understand how the wiki description would actually help me find it programmatically - Cratylus
You have to read the other Wiki: en.wikipedia.org/wiki/Generalised_suffix_tree. There are probably some better (more easily understandable) resources here. See also stackoverflow.com/questions/969448/… - Robert Harvey
@user384706 - I think part of the problem is that's not a proper suffix tree. You should only one branch that starts root-A-B-C, and the C would have two children: D-E and Z. Similar for the rest of the branches. What you have there is basically just a list of suffixes that all have a root node pointing at them. - twalberg
@twalberg:Yes, you are right. The 2 blue branches would be one. But in this case how could I programmatically find them? It is not clear to me how the wiki description helps - Cratylus

2 Answers

8
votes

Like what's been said before, your tree is incorrect.

This is what I get when running "ABCDE$XABCZ" through my code.

Suffix Tree code:

String = ABCDE$XABCZ$
End of word character 1 = $
└── (0)
    ├── (20) $
    ├── (22) ABC
    │   ├── (15) DE$
    │   └── (23) Z$
    ├── (24) BC
    │   ├── (16) DE$
    │   └── (25) Z$
    ├── (26) C
    │   ├── (17) DE$
    │   └── (27) Z$
    ├── (18) DE$
    ├── (19) E$
    ├── (21) XABCZ$
    └── (28) Z$

In a (compact) suffix tree, you need to find the deepest internal node(s) which have leaf nodes from all the strings. If you have multiple nodes at the same depth, you have to compare the length of the string represented by that node. i.e. ABC, BC, and C all have the same depth, so you have to compare the length of ABC, BC, and C strings to see which is longer; which is ABC obviously.

Suffix Trie code:

└── null
    ├── A
    │   └── B
    │       └── C
    │           ├── D
    │           │   └── (E) ABCDE
    │           └── (Z) ABCZ
    ├── B
    │   └── C
    │       ├── D
    │       │   └── (E) BCDE
    │       └── (Z) BCZ
    ├── C
    │   ├── D
    │   │   └── (E) CDE
    │   └── (Z) CZ
    ├── D
    │   └── (E) DE
    ├── (E) E
    ├── X
    │   └── A
    │       └── B
    │           └── C
    │               └── (Z) XABCZ
    └── (Z) Z

In a (not compact) suffix trie, find the deepest internal node(s) which have leaf nodes from all the strings.

Hope it helps.

4
votes

You have not actually drawn the suffix tree. Had you done it properly, at the root you would only have every possible character once. The tree only splits when a single letter can have multiple following suffixes. That forces common substrings together in the tree, which makes them findable.