22
votes

*This is a brief introduction, the specific question is in bold at the last paragraph.

I'm trying to generate all strings with a given Hamming Distance to solve efficiently a bioinformatic assignment.

The idea is, given a string (ie. 'ACGTTGCATGTCGCATGATGCATGAGAGCT'), the length of the word to search (ie. 4) and the acceptable mismatches when searching that word in the string (ie. 1), return the most frequent words or 'mutated' words.

To be clear, a word of length 4 from the given string can be this (between '[ ]'):

[ACGT]TGCATGTCGCATGATGCATGAGAGCT #ACGT

this

A[CGTT]GCATGTCGCATGATGCATGAGAGCT #CGTT

or this

ACGTTGCATGTCGCATGATGCATGAG[AGCT] #AGCT

What I did was (and its very inefficiently, and its really slow when the words need to have 10 characters) generate all possible words with the given distance:

itertools.imap(''.join, itertools.product('ATCG', repeat=wordSize))

and then search and compare every word in the given string if the generated words (or its mutation) appears in a loop:

wordFromString = givenString[i:i+wordSize]
mismatches = sum(ch1 != ch2 for ch1, ch2 in zip(wordFromString, generatedWord))
if mismatches <= d:
    #count that generated word in a list for future use
    #(only need the most repeated)

What I want to do is, instead of generating ALL possible words, generate just the mutations of the words that appear in the given string with a given number of mismatches, in other words, given a Hamming Distance and a word, return all the possible mutated words with that (or less) distance, and then use them for searching in the given string.

I hope I was clear. Thank you.

6
would you count overlaps? For example, for the string, "CCACCAC", and the word, "CCAC", would you count two appearances or just one? since the "C" overlaps.גלעד ברקן
Agnostic answer with explanation stackoverflow.com/a/34523345/1090562Salvador Dali

6 Answers

18
votes
def mutations(word, hamming_distance, charset='ATCG'):
    for indices in itertools.combinations(range(len(word)), hamming_distance):
        for replacements in itertools.product(charset, repeat=hamming_distance):
            mutation = list(word)
            for index, replacement in zip(indices, replacements):
                mutation[index] = replacement
            yield "".join(mutation)

This function generates all mutations of a word with a hamming distance less than or equal to a given number. It is relatively efficient, and does not check invalid words. However, valid mutations can appear more than once. Use a set if you want every element to be unique.

5
votes

Let the given Hamming distance be D and let w be the "word" substring. From w, you can generate all words with distance ≤D by a depth-limited depth-first search:

def dfs(w, D):
    seen = set([w])      # keep track of what we already generated
    stack = [(w, 0)]

    while stack:
        x, d = stack.pop()
        yield x

        if d == D:
            continue

        # generate all mutations at Hamming distance 1
        for i in xrange(len(x)):
            for c in set("ACGT") - set(x[i])
                 y = x[:i] + c + x[i+1:]
                 if y not in seen:
                     seen.add(y)
                     stack.append((y, d + 1))

(This will by no means be fast, but it may serve as inspiration.)

5
votes

If I understand your problem correctly, you want to identify the highest score k-mers in a genome G. A k-mer's score is the number of times it appears in the genome plus the number of times any k-mer with Hamming distance less than m also appears in the genome. Note that this assumes you are only interested in k-mers that appear in your genome (as pointed out by @j_random_hacker).

You can solve this in four basic steps:

  1. Identify all k-mers in the genome G.
  2. Count the number of times each k-mer appears in G.
  3. For each pair (K1, K2) of k-mers, increment the count for both K1 and K2 if their Hamming distance is less than m.
  4. Find the max k-mer and its count.

Here's example Python code:

from itertools import combinations
from collections import Counter

# Hamming distance function
def hamming_distance(s, t):
    if len(s) != len(t):
        raise ValueError("Hamming distance is undefined for sequences of different lengths.")
    return sum( s[i] != t[i] for i in range(len(s)) )

# Main function
# - k is for k-mer
# - m is max hamming distance
def hamming_kmer(genome, k, m):
    # Enumerate all k-mers
    kmers = [ genome[i:i+k] for i in range(len(genome)-k + 1) ]

    # Compute initial counts
    initcount  = Counter(kmers)
    kmer2count = dict(initcount)

    # Compare all pairs of k-mers
    for kmer1, kmer2 in combinations(set(kmers), 2):
        # Check if the hamming distance is small enough
        if hamming_distance(kmer1, kmer2) <= m:
            # Increase the count by the number of times the other
            # k-mer appeared originally
            kmer2count[kmer1] += initcount[kmer2]
            kmer2count[kmer2] += initcount[kmer1]

    return kmer2count


# Count the occurrences of each mismatched k-mer
genome = 'ACGTTGCATGTCGCATGATGCATGAGAGCT'
kmer2count = hamming_kmer(genome, 4, 1)

# Print the max k-mer and its count
print max(kmer2count.items(), key=lambda (k,v ): v )
# Output => ('ATGC', 5)
2
votes

Here's what I think the problem that you're trying to solve is: You have a "genome" of length n, and you want to find the k-mer that approximately appears most frequently in this genome, where "approximately appears" means appears with Hamming distance <= d. This k-mer need not actually appear exactly anywhere in the genome (e.g. for genome ACCA, k=3 and d=1, the best k-mer is CCC, appearing twice).

If you generate all k-mers of Hamming distance <= d from some k-mer in the string and then search for each one in the string, as you seem to be currently doing, then you're adding an unnecessary O(n) factor to the search time (unless you search for them all simultaneously using the Aho-Corasick algorithm, but that's overkill here).

You can do better by going through the genome, and at each position i, generating the set of all k-mers that are at distance <= d from the k-mer starting at position i in the genome, and incrementing a counter for each one.

1
votes
def generate_permutations_close_to(initial = "GACT",charset="GACT"):
    for i,c in enumerate(initial):
         for x in charset:
             yield initial[:i] + x + inital[i+1:]

will generate permutations with a dist of 1 (it will also repeat contain repeats)

to get a set of all within 2 ... then call this with each of the first solutions as initial guesses

1
votes

There are correct answers here, which heavily utilize python with it's magical functions that do almost everything for you. I will try to explain things with math and algorithms, so that you can apply it to any language you want.


So you have an alphabet {a1, a2, ... a_a} (the cardinality of a)in your case {'A', 'C', 'G', 'T'} and the cardinality is 4. You have a string of length k and you want to generate all the strings whose hamming distance is less or equal to d.

First of all how many of them do you have? The answer does not depend on the string you select. If you selected a sting, you will have C(d, k)(a-1)^d strings of which have a hamming distance d from your string. So total number of strings is:

enter image description here

It rises exponentially in terms of almost every parameter, so you will not have any sort of fast algorithm to find all the words.


So how would you derive an algorithm that will generate all the strings? Notice that it is easy to generate a string which is at most one hamming distance away from your wold. You just need to iterate over all characters in the string and for each character try each letter in the alphabet. As you will see, some of the words would be the same.

Now to generate all the strings that are two hamming distances away from your string you can apply the same function that generate one hamming distance words for each word in the previous iteration.

So here is a pseudocode:

function generateAllHamming(str string, distance int): 
    alphabet = ['A', ...]// array of letters in your alphabet
    results = {} // empty set that will store all the results
    prev_strings = {str} // set that have strings from the previous iterations
    // sets are used to get rid of duplicates

    if distance > len(str){ distance = len(str)} // you will not get any new strings if the distance is bigger than length of the string. It will be the same all possible strings.

    for d = 0; d < distance; d++ {
        for one_string in prev_strings {
            for pos = 0; pos < len(one_string); pos++ {
                for char in alphabet {
                    new_str = substitute_char_at_pos(one_string, char, pos)
                    add new_str to set results 
                }
            }
        }

        populate prev_strings with elements from results
    }

    return your results
}