13
votes

This question is slightly different from the kind of finding longest sequence or substring from two strings.

Given two string of the same size N, find the longest substrings from each string such that the substrings contains the same bag of chars.

The two substrings may not necessarily have the same sequence. But they must have the same bag of chars.

For example,

a = ABCDDEGF b = FPCDBDAX

The longest matching bag of chars are ABCDD (ABCDD from a, CDBDA from b)

How to solve this problem?


UPDATE

The goal is to find substrings from each input string, such that they have the same bag of chars. By saying "substring", they must be consecutive chars.


Update: Initially I thought of a dynamic programming approach. It works as below.

To compare two bags of chars of the same length K, it would take O(K) time to achieve so. Convert each string into a shorten form:

ABCDDEGF -> A1B1C1D2E1G1F1
FPCDBDAX -> A1B1C1D2F1P1X1

The shorten form is sorted alphabets followed by number of frequencies in the string. Construct, sort, and compare the shorten forms would take O(K) time in total. (Implementation can be achieved by using an array of chars though)

Two bags of chars are equal iif their shorten forms have the same chars and respective frequencies.

In addition, it takes O(logK) time to find the difference chars between the two string.

Now, for two input strings:

  1. If their shorten forms are identical, then this is the longest common bag of chars.
  2. Find chars in string1 such that they do not appear in string2. Tokenize string1 into several substrings based on those chars.
  3. Find chars in string2 such that they do not appear in string1. Tokenize string2 into several substrings based on those chars.
  4. Now, we have two list of strings. Compare each pair (which in turn is the same problem with a smaller size of input) and find the longest common bag of chars.

The worst case would be O(N3), and best case would be O(N). Any better idea?

4
It looks like you're using count sort for the "shorten form" - you can only use it if you know the range of your characters. Next, you aren't really using the count, only as a way of checking what characters are present. As for point 4 - it isn't a smaller problem input. Given abbbbbb and aaaaaaaaab, you cannot eliminate any letter. Further, the count of the characters gives you very little information, especially when you don't know K from the start.Kobi
@Kobi: Chars are ranged integers. For example, ASCII would be in range 0 - 128. It will be more tricky to allow Unicode chars. We need the frequency count to test "equality" the two shorten forms.SiLent SoNG
So, are you checking for every sub-string in all lengths?Kobi
I'm sorry - isn't it ab, or am I not understanding the question?Kobi
@Kobi: Initially the two input strings have the same length. It is possible that after tokenization we will face two strings of different length... you are right. I am thinking a solution to your question.SiLent SoNG

4 Answers

5
votes

Create a set of the characters present in a, and another of the characters present in b. Walk through each string and strike (e.g., overwrite with some otherwise impossible value) all the characters not in the set from the other string. Find the longest string remaining in each (i.e., longest string of only "unstruck" characters).

Edit: Here's a solution that works roughly as noted above, but in a rather language-specific fashion (using C++ locales/facets):

#include <string>
#include <vector>
#include <iostream>
#include <locale>
#include <sstream>
#include <memory>

struct filter : std::ctype<char> {
    filter(std::string const &a) : std::ctype<char>(table, false) {
        std::fill_n(table, std::ctype<char>::table_size, std::ctype_base::space);

        for (size_t i=0; i<a.size(); i++) 
            table[(unsigned char)a[i]] = std::ctype_base::upper;
    }
private:
    std::ctype_base::mask table[std::ctype<char>::table_size];
};

std::string get_longest(std::string const &input, std::string const &f) { 
    std::istringstream in(input);
    filter *filt = new filter(f);

    in.imbue(std::locale(std::locale(), filt));

    std::string temp, longest;

    while (in >> temp)
        if (temp.size() > longest.size())
            longest = temp;
    delete filt;
    return longest;
}

int main() { 
    std::string a = "ABCDDEGF",  b = "FPCDBDAX";
    std::cout << "A longest: " << get_longest(a, b) << "\n";
    std::cout << "B longest: " << get_longest(b, a) << "\n";
    return 0;
}

Edit2: I believe this implementation is O(N) in all cases (one traversal of each string). That's based on std::ctype<char> using a table for lookups, which is O(1). With a hash table, lookups would also have O(1) expected complexity, but O(N) worst case, so overall complexity would be O(N) expected, but O(N2) worst case. With a set based on a balanced tree, you'd get O(N lg N) overall.

3
votes

Just a note to say that this problem will not admit a "greedy" solution in which successively larger bags are constructed by extending existing feasible bags one element at a time. The reason is that even if a length-k feasible bag exists, there need not be any feasible bag of length (k-1), as the following counterexample shows:

ABCD
CDAB

Clearly there is a length-4 bag (A:1, B:1, C:1, D:1) shared by the two strings, but there is no shared length-3 bag. This suggests to me that the problem may be quite hard.

1
votes

lets look at this problem like this.. this solution is going to more optimized and will be very easy to code but read through the def and you MUST read the code to get the idea... else it will just sound crazy and complex

THINK ABOUT THIS

in your questions the 2 example strings you gave lets take them as two set, i.e {x,y,z}, of characters...

AND.. AND... your resulting substring(set) will be one with characters common in both strings(sets) and will be continuous entries and the qualifying substring(ser) will be one with highest number of entries

above are a few properties of the result but will only work if used via the following algorithm\methodolgy

we have two sets

a = { BAHYJIKLO }

b = { YTSHYJLOP }

Take

a U b = { - , - , H , Y , J , - , - , L , O }

b U a = {Y , - , - , H , Y , J , L , O , -}

its just that i have replaced the characters who didn't qualify for union set with a "-" or any special\ignored character

doing so we have two strings from which we can easily extract HYJ,LO,Y,HYJLO

now string\substrings comparisons and different processing takes time so what i do is i write these strings\substrings to a text file with separated by space or different lines.. so that when i read a file i get the whole string instead of having a nested loop to locate a substring or manage temporary variables....

after you have HYJ,LO,Y,HYJLO i don't think its a problem to find your desired result....

NOTE: if you start processing the strings and sub strings in this with temporary variables and nested loops for first make a sub string then search for it... then its going to be very costly solution... you have to use filing like this...

char a[20], b[20]; //a[20] & b[30] are two strings
cin>>a; cin>>b;
int t=0;

open a temporary text file "file1" to write '(built-in-function works here)'
//a U b
for(int x=0; x<length(a); x++)
{
    t=0;

    for(int y=0; y<length(b); x++)
       { if( a[x] == b[y]) t=1; }

    if(t == 1)
       { 
          write 'a[x]' to the file1 '(built-in-function works here)'
          t=0;
       }
    else
       write a 'space' to the file1 '(built-in-function works here)'
}

//b U a
for(int x=0; x<length(a); x++)
{
    t=0;

    for(int y=0; y<length(b); x++)
       { if( b[x] == a[y]) t=1; }

    if(t == 1)
       {
         write 'a[x]' to the file1 '(built-in-function works here)'
         t=0;
       }
    else
       write a 'space' to the file1 '(built-in-function works here)'
}
/*output in the file wil be like this
_____FILE1.txt_____
  HYJ  LO Y HYJLO        
*/
//load all words in an array of stings from file '(built-in-function works here)'

char *words[]={"HYJ","LO","Y","HYJLO"};
int size=0,index=0;

for( int x=0; x<length(words); x++)
    for( int y=0; x<length(words); y++)
    {
       if( x!=y && words[x] is a substring of words[y] ) // '(built-in-function works here)'
          {
               if( length(words[x] ) < size )
               {
                     size = length(words[x];
                     index = x;
               }
          }
    }

 cout<< words[x]; 
 //its the desired result.. its pretty old school bu i think you get the idea

}

i wrote the code for... its working if you want it gimme you email i will send it to you... b.t.w i like this problem and the complexity of this algo is 3n(square)

0
votes

Here's my rather anti-pythonic implementation that nevertheless leverages python's wonderful built in sets and strings.

a = 'ABCDDEGF'
b = 'FPCDBDAX'

best_solution = None
best_solution_total_length = 0

def try_expand(a, b, a_loc, b_loc):
    # out of range checks
    if a_loc[0] < 0 or b_loc[0] < 0:
        return
    if a_loc[1] == len(a) or b_loc[1] == len(b):
        return


    if set(a[a_loc[0] : a_loc[1]]) == set(b[b_loc[0] : b_loc[1]]):
        global best_solution_total_length, best_solution
        #is this solution better than anything before it?
        if (len(a[a_loc[0] : a_loc[1]]) + len(b[b_loc[0] : b_loc[1]])) > best_solution_total_length:
            best_solution = (a_loc, b_loc)
            best_solution_total_length = len(a[a_loc[0] : a_loc[1]]) + len(b[b_loc[0] : b_loc[1]])


    try_expand(a, b, (a_loc[0]-1, a_loc[1]), (b_loc[0], b_loc[1]))
    try_expand(a, b, (a_loc[0], a_loc[1]+1), (b_loc[0], b_loc[1]))
    try_expand(a, b, (a_loc[0], a_loc[1]), (b_loc[0]-1, b_loc[1]))
    try_expand(a, b, (a_loc[0], a_loc[1]), (b_loc[0], b_loc[1]+1))


for a_i in range(len(a)):
    for b_i in range(len(b)):
        # starts of the recursive expansion from identical letters in two substrings
        if a[a_i] == b[b_i]:
            # if substrings were expanded from this range before then there won't be an answer there
            if best_solution == None or best_solution[0][0] > a_i or best_solution[0][1] <= a_i or best_solution[1][0] > b_i or best_solution[1][1] <= b_i:
                    try_expand(a, b, (a_i, a_i), (b_i, b_i))


print a[best_solution[0][0] : best_solution[0][1]], b[best_solution[1][0] : best_solution[1][1]]

Forgot to mention that this is obviously a fairly bruteforce approach and I'm sure there's an algorithm that runs much, much faster.