*This is a brief introduction, the specific question is in bold at the last paragraph.
I'm trying to generate all strings with a given Hamming Distance to solve efficiently a bioinformatic assignment.
The idea is, given a string (ie. 'ACGTTGCATGTCGCATGATGCATGAGAGCT'), the length of the word to search (ie. 4) and the acceptable mismatches when searching that word in the string (ie. 1), return the most frequent words or 'mutated' words.
To be clear, a word of length 4 from the given string can be this (between '[ ]'):
[ACGT]TGCATGTCGCATGATGCATGAGAGCT #ACGT
this
A[CGTT]GCATGTCGCATGATGCATGAGAGCT #CGTT
or this
ACGTTGCATGTCGCATGATGCATGAG[AGCT] #AGCT
What I did was (and its very inefficiently, and its really slow when the words need to have 10 characters) generate all possible words with the given distance:
itertools.imap(''.join, itertools.product('ATCG', repeat=wordSize))
and then search and compare every word in the given string if the generated words (or its mutation) appears in a loop:
wordFromString = givenString[i:i+wordSize]
mismatches = sum(ch1 != ch2 for ch1, ch2 in zip(wordFromString, generatedWord))
if mismatches <= d:
#count that generated word in a list for future use
#(only need the most repeated)
What I want to do is, instead of generating ALL possible words, generate just the mutations of the words that appear in the given string with a given number of mismatches, in other words, given a Hamming Distance and a word, return all the possible mutated words with that (or less) distance, and then use them for searching in the given string.
I hope I was clear. Thank you.