Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:
Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.
I have a function substring(x, y)
which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:
- Naive: Start by concatenating all strings in X to form a string A(n). Apply substring(A(n), B) - this includes iterating backward in string A(n). If true, the algorithm ends here and returns A(n) - or whatever portion of it is included in said substring. If not, proceed to apply (A(n - 1), B) and so on. If such a string does not exist in X, I return the empty string.
Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...))
worst case time, or more depending on substring() (correct me if wrong).
I thought of a second approach based on suffix trees/arrays.
- Generalized Suffix Tree: I build a GST of sequence Y using Ukkonen's algorithm in
O(maxlength(y1, y2,...)
(?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.
If there is a better approach, I'd love to know.
EDIT: Apologies if it seemed like I abandoned the topic.
What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.
Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?
It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?