0
votes

I have a list of strings for which I need to find ALL common unique substrings (actually paths) with minimal length in them. Example:

/a/b/c

/a/b

/a

/d/e/f

/d/e

/g/h

For this input, I need the following result:

/a

/d/e

/g/h

As you see, I need the paths (or substrings) with the minimal length that have a unique prefix. /a is the minimal substring for all paths starting with /a. /d/e is the minimal substring for all paths starting with /d/e. The same goes for /g/h.

A practical application of this is to find all roots of the path trees that have a certain file in it to analyze them further. Consider this example:

/a/b/c/index.html

/a/b/index.html

/a/index.html

/d/e/f/index.html

/d/e/index.html

/g/h/index.html

Let's say I want to have the topmost (in terms of the root) paths that contain an index.html file. As a result, I want "/a/index.html", "/d/e/index.html" and "/g/h/index.html".

Any ideas? There is a lot of theory and examples for the "simple" longest common substring problem, but I have not yet found a solution for finding ALL the common longest substrings efficiently.

A solution with pseudo code would be highly appreciated.

1
Is the longest substring always going to be a member of the group? And, what language are you using? (You can add a tag for that.) And, I don't see why /g/h would be part of the output.Gordon Linoff
From your example I assume you need something like the smallest set of substrings that covers your entire list. Is this correct?biziclop
There is no common substring if you consider all the items of your input, are you talking about 3 different inputs and their corresponding outputs?Aaron
Dear upvoter, please do tell us, what the task is...Karoly Horvath
@KarolyHorvath No, I mean that a covers the first three entries (abc, ab and a), de covers de and def and gh covers gh. So there is no element in the source set which hasn't got a substring in the result set. That's my guess :)biziclop

1 Answers

0
votes

Now with the improved description I think the following algorithm will do:

  1. Split the list of strings into a list of segments (list of array of string)
  2. Start with i = 1 and increase it every iteration doing the following (step 3 and 4) until there are no more items in the list of segments:
  3. Add all segment arrays with length i to the list (if not already in there) of current solutions and the corresponding paths to the final solution.
  4. Remove all items from the list of segments for which the first i items are the same as for one of the items in the current solution (then reset the current solution).