Using binary search and hashing for pattern matching with k-most mismatches

Question

I am trying to solve a program where I use hashing (via a rolling hash structure) and a binary search to find all positions where a pattern can be found in some text. The role of binary search in this can be better explained by Mr. Sergei Tsaturian on a coursera discussion forum.

For example, you want to find 2 mismatches in strings aabaabba and aaaaaaaa. First, you look for the first mismatch. It's the first position i such that strings are the same on [0,i-1] but different in i-th position (in our case it should be 2). So, we can use binary search to find it, checking at each step whether the strings [0,i-1] are the same. If yes, move to the right half, if not, move to the left half. After you found the first mismatch, consider the parts of the strings after the first mismatch (aabba and aaaaa). Now you want to find the first mismatch in there (this should output position 5 in the original string), and you can use another binary search for that. So, you will need up to k binary searches in total to find k mismatches.

My rolling-hash and other functions are working alright since I have tested them for other programs - it is the binary search where I am faltering.

My binary search program uses 4 pointers - start (the regular low pointer), end (the high pointer), mid and id. The only variation from a standard binary search is id and id marks the start of the current portion of the substring that I am considering so I can obtain the hashes easily. When start is greater than end, the index of the mismatch found is start - 1 and therefore the id pointer becomes start to negate the mismatch and the string before it and the conditional checks if the mismatch is at the end of the substring but this approach breaks down when considering a match of 'cab' and 'ccc' with at most 1 mismatch. Here it returns True even though it is very clearly False.

My function:

def find_num_matches(p1, p2, t1, t2, m1, m2, x, k, len_p, i):
    """
    Uses binary search to find the number of mismatches. It finds the left-most
    mismatch and then looks at the substring hash after that index position and
    continues until the number of mismatches are more than k or the start pointer
    is more than or equal to end.
    """
    start = 0
    for mismatch in range(k):
        # print(f'start: {start}, id: {id}')
        id = start
        end = len_p-1
        while start <= end:
            mid = start + (end-start)//2
            p_h1, p_h2 = get_hash_value(
                p1, m1, x, id, mid-start), get_hash_value(p2, m2, x, id, mid-start)
            s_h1, s_h2 = get_hash_value(
                t1, m1, x, i+id, mid-start), get_hash_value(t2, m2, x, i+id, mid-start)
            if p_h1 == s_h1 and p_h2 == s_h2:  # move to the right half, no mismatch yet
                start = mid + 1
            else:
                end = mid - 1
        # when loop exits, start - 1 is the index of the mismatch.
        if start == len_p:  # found the last mismatch
            return True
    return False

The function get hash-value simple returns the hash-value of a substring of the text or the pattern to be matched and the addition of i for the substring simply indicates where the current substring starts on the whole text.

The variable i is passed from the function that calls binary search. In the text 'abcdfe' where you match pattern 'abd', you try all 3 letter patterns so you run a loop in range(0 to len(text) - len(pattern) + 1) and i is the counter in this loop. It indicates where your current substring starts as part of the whole text.

Please let me know if you have any suggestions on how I can correct my approach and rectify my error - thank you!

EDIT:

An edge-case where this breaks is comparing 'caa' with 'ccc' and k = 1. On the first iteration; start = 0, end = 2, mid = 1, id = 0 and so the hashes from index id for length mid - start i.e 'c' and 'c' are compared. Since these match, start becomes mid + 1. Second iteration; start = 2, end = 2, mid = 2, id = 0 and so the hashes from index id for length mid - start which is length 0 are compared and so the hashes (or lack thereof) match and start increments to mid + 1 which is 3. Now, since start > end, it exits the loops and assumes the char at index start - 1 is the left-most mismatch.

mirzanahal mirzanahal · Accepted Answer · 2020-09-03T06:21:06

It seems you never check your mids. I think you should change your start and end updates:

if p_h1 == s_h1 and p_h2 == s_h2:  # move to the right half, no mismatch yet
   start = mid
else:
   end = mid

Using binary search and hashing for pattern matching with k-most mismatches

1 Answers