Longest Common Subsequence(LCS) intuition

votes

The recursive version of LCS code looks something like this(m, n are lengths of the strings X and Y respectively)

    int lcs( char[] X, char[] Y, int m, int n ) { 
      if (m == 0 || n == 0) 
        return 0; 
      if (X[m-1] == Y[n-1]) 
        return 1 + lcs(X, Y, m-1, n-1); 
      else
        return max(lcs(X, Y, m, n-1), lcs(X, Y, m-1, n)); 
    }

I notice that this algo deletes characters of the strings from the end and creates various substrings of the original two strings and then tries to look for a match.

What I don't understand is, since it considers only certain substrings and not all possible subsequences, how is this algorithm guaranteed to provide the correct result? Is there a logical/mathematical/intuitive proof behind this?

stringalgorithmrecursiondynamic-programminglcs

2 Answers

votes

one of the step in Dynamic programming is guessing (check out MIT lecture on YouTube)

You have two pointers to each character in both the strings respectively.

The guess here is to either include the character or not.

if both the character are same then you include the character and move the pointer ahead.
chose a character either from String X or Y. hence the max(lcs(X, Y, m, n-1), lcs(X, Y, m-1, n))

it's recurrence is given as:

Since you're guessing all the possible outcomes the above recurrence gives correct answer.

for more reference: https://courses.csail.mit.edu/6.006/fall10/handouts/recitation11-19.pdf

votes

It calculates either lcs( X, Y, m-1, n-1) when the last chars are the same.

It calculates both lcs( X, Y, m-1,n ) AND lcs( X, Y, m, n-1) when the characters are different.

So it covers all possible solutions.

If the lengths are 0. Then lcs is 0. If the last (or first) items are the same, then the lcs is 1 + lcs(X,Y,m-1,n-1) This clause reduces the problem by adding the common element to the lcs, and finding the smaller lcs of the 2 shorter strings.

Final the longest subsequence is either

lcs(X, Y, m-1, n)

lcs(X, Y, m, n-1)

So when it is asked to calculate lcs( "abce", "bghci", 4, 5);

if( m(4) == 0 || n(5) == 0 ) // evaluates to false
if( X[3]('e')  == Y[4] ) // evaluates to false
return max( lcs( "abc", "bghci", 3, 5 ),   // These two recursive calls
            lcs( "abce", "bghc", 4, 4 ) ); // cover the whole problem space

e.g.

(A) lcs( "abce", "bghci" ) =>  (B) lcs( "abc", "bghci" ) 
                               (C) lcs( "abce", "bghc" )
(B) lcs( "abc",  "bghci" ) =>  (D) lcs( "ab",   "bghci" ) 
                               (E) lcs( "abc",  "bghc" )
(C) lcs( "abce", "bghc" )  =>  (F) lcs( "abc",  "bghc" )
                               (G) lcs( "abce", "bgh"  )
(D) lcs( "ab"  , "bghci")  =>  (H) lcs( "a" ,   "bghci")
                               (I) lcs( "ab"  , "bghc" )
(E) lcs( "abc", "bghc" )   =>  1 + (J) lcs( "ab"  , "bgh")
(F) lcs( "abc", "bghc" )   =>  1 + (K) lcs( "ab"  , "bgh")
(G) lcs( "abce", "bgh" )   =>  (L) lcs( "abc", "bgh" )
                               (M) lcs( "abce", "bg" )
(H) lcs( "a", "bghci")     =>      lcs( "",    "bghci" )  (0)
                               (N) lcs( "a",   "bghc" )
(I) lcs( "ab", "bghc")     =>  (O) lcs( "a",   "bghc" )
                               (P) lcs( "ab",  "bgh" )
(J) lcs( "ab", "bgh" )     =>  (Q) lcs( "a",   "bgh" )
                               (R) lcs( "ab",  "bg" )
(K) lcs( "ab", "bgh" )     =>  (S) lcs( "a",   "bgh" )
                               (T) lcs( "ab",  "bg" )
(L) lcs( "abc", "bgh" )    =>  (U) lcs( "ab",  "bg" )
                               (V) lcs( "abc", "bg" )
(M) lcs( "abce", "bg" )    =>  (W) lcs( "abc", "bg" )
                               (X) lcs( "abce","b" )
(N) lcs( "a",  "bghc")     =>      lcs( "",    "bghc" )   (0)
                               (Y) lcs( "a",   "bgh" )
(O) lcs( "a", "bghc" )     =>      lcs( "",    "bghc" )   (0)
                                   lcs( "a"    "bgh" )
(P) lcs( "ab", "bgh" )     =>  (Z) lcs( "a",   "bgh" )
                              (AA) lcs( "ab",  "bg" )
(Q) lcs( "a",  "bgh" )     =>      lcs( "",    "bgh")     (0)
                              (AB) lcs( "a",   "bg")
(R) lcs( "ab", "bg" )      => (AC) lcs( "a",   "bg")
                              (AD) lcs( "ab",  "b" )
(S) lcs( "a", "bgh" )      =>      lcs( "",    "bgh")     (0)
                              (AE) lcs( "a",   "bg" )
(T) lcs( "ab", "bg" )      => (AF) lcs( "a",   "bg" )
                              (AG) lcs( "ab",  "b" )
(U) lcs( "ab", "bg" )      => (AH) lcs( "a",   "bg" )
                              (AI) lcs( "ab",  "b" )
(V) lcs( "abc","bg" )      => (AJ) lcs( "ab",  "bg" )
                           => (AK) lcs( "abc", "b" )
(W) lcs( "abc","bg" )      => (AL) lcs( "ab",  "bg" )
                              (AM) lcs( "abc", "b"  )
(X) lcs( "abce", "b")      => (AN) lcs( "abc", "b" )
                                   lcs( "abce", "" )     (0)
(Y) lcs( "abce", "b")      => (AO) lcs( "abc", "b" )
                                   lcs( "abce", "" )     (0)
(Z) lcs( "a", "bgh")       =>      lcs( "",    "bgh" )   (0)
                              (AP) lcs( "a",   "bg" )
(AA)lcs( "ab", "bg")       => (AQ) lcs( "a",   "bg" )
                              (AR) lcs( "ab",  "b" )
(AB)lcs( "a",  "bg")       =>      lcs( "",    "bg" )    (0)
                              (AS) lcs( "a",   "b" )
(AC)lcs( "a",  "bg")       =>      lcs( "",    "bg")     (0)
                              (AT) lcs( "a",   "b")
(AD)lcs( "ab", "b")        =>  1 + lcs( "a",   "" )      (1)
(AE)lcs( "a",  "bg")       =>      lcs( "",    "bg")
                              (AU) lcs( "a",   "b" )
(AF)lcs( "a",  "bg")       =>      lcs( "",    "bg")     (0)
                              (AV) lcs( "a",   "b")
(AG)lcs( "ab", "b" )       =>  1 + lcs( "a",   "" )      (1)
(AH)lcs( "a",  "bg")       =>      lcs( "",    "bg")     (0)
                              (AW) lcs( "a",   "b" )
(AI)lcs( "ab", "b")        =>  1 + lcs( "a",   "" )      (1)
(AJ)lcs( "ab", "bg")       => (AX) lcs( "a",   "bg")
(AK)lcs( "abc", "b")       => (AY) lcs( "ab",  "b")
                                   lcs( "abc", "b")      (0)
(AL)lcs( "ab", "bg")       => (AZ) lcs( "a",   "bg")
                              (BA) lcs( "ab",  "b")
(AM)lcs( "abc", "b")       => (BB) lcs( "ab",  "b")
                                   lcs( "abc", "" )      (0)
(AN)lcs( "abc", "b")       => (BC) lcs( "ab",  "b")
                                   lcs( "abc", "" )      (0)
(AO)lcs( "abc", "b")       => (BD) lcs( "ab",  "b")
                                   lcs( "abc", "" )      (0)
(AP)lcs( "a", "bg")        =>      lcs( "",    "bg")     (0)
                              (BE) lcs( "a",   "b") 
(AQ)lcs( "a", "bg")        =>      lcs( "",    "bg")     (0)
                              (BF) lcs( "a",   "b") 
(AR)lcs( "ab", "b")        => 1 +  lcs( "a", "" )        (1)
(AS)lcs( "a",  "b")        =>      lcs( "",  "b")        (0)
                                   lcs( "a", "" )        (0)
(AT)lcs( "a", "b" )        as (AS)
(AU)lcs( "a", "b" )        as (AS)
(AV)lcs( "a", "b")         as (AS)
(AW)lcs( "a", "b")         as (AS)
(AX)lcs( "a", "bg")        =>      lcs( "", "bg")        (0)
                              (BG) lcs( "a", "b")
(AY)lcs( "ab", "b")        => 1 +  lcs( "a", "" )        (1)
(AZ)lcs( "a",  "bg")       =>      lcs( "", "bg")        (0)
                              (BH) lcs( "a", "b")
(BA)lcs( "ab", "b")        => 1 +  lcs( "a", "" )        (1)
(BB)lcs( "ab", "b")        => 1 +  lcs( "a", "" )        (1)
(BC)lcs( "ab", "b")        => 1 +  lcs( "a", "" )        (1)
(BD)lcs( "ab", "b")        => 1 +  lcs( "a", "" )        (1)
(BE)lcs( "a",  "b")        as (AS)
(BF)lcs( "a",  "b")        as (AS)
(BG)lcs( "a",  "b")        as (AS)

So the lcs of 2 comes from the following call stack....

lcs( "abcde", "bghci")               // (A)
  lcs( "abcd", "bghci" )             // (B)
    lcs( "abc",  "bghc" )            // (E)
      1 + lcs( "ab", "bgh" )         // (J)
        lcs( "ab",  "bg" )           // (R)
          lcs( "ab",  "b" )          // (AD)
            1 + lcs( "a",   "" )

Which gives an answer of 2. As you can see, it tests alot more combinations.

What I don't understand is, since it considers only certain substrings and not all possible subsequences,

The trick in the algorithm is to notice that if the first (or last character is the same), then the longest common subsequence is 1 + the lcs of the 2 shorter strings. Something like proof by induction, or proof by contradiction can help you prove that the division of work is necessary and sufficient to cover all the possible alternatives.

As can be seen in my build out, it considers many alternatives multiple times in calculating the lcs, so it is not a very efficient algorithm.