Longest common substring in two sequences of strings

Question

Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:

Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.

I have a function substring(x, y) which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:

Naive: Start by concatenating all strings in X to form a string A(n). Apply substring(A(n), B) - this includes iterating backward in string A(n). If true, the algorithm ends here and returns A(n) - or whatever portion of it is included in said substring. If not, proceed to apply (A(n - 1), B) and so on. If such a string does not exist in X, I return the empty string.

Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...)) worst case time, or more depending on substring() (correct me if wrong).

I thought of a second approach based on suffix trees/arrays.

Generalized Suffix Tree: I build a GST of sequence Y using Ukkonen's algorithm in O(maxlength(y1, y2,...)(?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.

If there is a better approach, I'd love to know.

EDIT: Apologies if it seemed like I abandoned the topic.

What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.

Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?

It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?

You want a string that is a substring of all the strings in Y. But B is a union of all the strings (not intersection). Hence you would be finding the string that is a substring of atleast 1 of the strings in Y. Am I missing something? — Abhishek Bansal
A substring of the contatenation of m strings is not necessarily a substring of one of those m strings. The following statement is therefor false: "Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B." — Kris Vandermotten
Y is a sequence of strings. Shouldn't I be looking for the union/concatenation of all strings then? That's what I'm thinking when I read "all strings in Y". I'm probably messing this up. Could you explain with an example please? — PritishC

Vishkey Vishkey · Accepted Answer · 2013-10-03T12:00:03

First, order the array X for the longest string to shorter. This way, the first string in X that be a substring of all Y strings is the solution.

A multiprocessor algorithm would be the best way to solve the problem of test each X string with all Y strings quickly.

Longest common substring in two sequences of strings

3 Answers