How to find all common longest substrings of a list of strings

Question

I have a list of strings for which I need to find ALL common unique substrings (actually paths) with minimal length in them. Example:

/a/b/c

/a/b

/a

/d/e/f

/d/e

/g/h

For this input, I need the following result:

/a

/d/e

/g/h

As you see, I need the paths (or substrings) with the minimal length that have a unique prefix. /a is the minimal substring for all paths starting with /a. /d/e is the minimal substring for all paths starting with /d/e. The same goes for /g/h.

A practical application of this is to find all roots of the path trees that have a certain file in it to analyze them further. Consider this example:

/a/b/c/index.html

/a/b/index.html

/a/index.html

/d/e/f/index.html

/d/e/index.html

/g/h/index.html

Let's say I want to have the topmost (in terms of the root) paths that contain an index.html file. As a result, I want "/a/index.html", "/d/e/index.html" and "/g/h/index.html".

Any ideas? There is a lot of theory and examples for the "simple" longest common substring problem, but I have not yet found a solution for finding ALL the common longest substrings efficiently.

A solution with pseudo code would be highly appreciated.

Is the longest substring always going to be a member of the group? And, what language are you using? (You can add a tag for that.) And, I don't see why /g/h would be part of the output. — Gordon Linoff
From your example I assume you need something like the smallest set of substrings that covers your entire list. Is this correct? — biziclop
There is no common substring if you consider all the items of your input, are you talking about 3 different inputs and their corresponding outputs? — Aaron
@KarolyHorvath No, I mean that a covers the first three entries (abc, ab and a), de covers de and def and gh covers gh. So there is no element in the source set which hasn't got a substring in the result set. That's my guess :) — biziclop

maraca maraca · Accepted Answer · 2015-11-07T12:04:38

Now with the improved description I think the following algorithm will do:

Split the list of strings into a list of segments (list of array of string)
Start with i = 1 and increase it every iteration doing the following (step 3 and 4) until there are no more items in the list of segments:
Add all segment arrays with length i to the list (if not already in there) of current solutions and the corresponding paths to the final solution.
Remove all items from the list of segments for which the first i items are the same as for one of the items in the current solution (then reset the current solution).

How to find all common longest substrings of a list of strings

1 Answers