I have a huge number of records containing sequences ('ATCGTGTGCATCAGTTTCGA...'), up to 500 characters. I also have a list of smaller sequences, usually 10-20 characters. I would like to use the Levenshtein distance in order to find these smaller sequences inside the records allowing for small changes or indels (L_distance <=2).
The problem is that I also want to get the start position of such smaller sequences, and apparently it only compares sequences of the same length.
>>> import Levenshtein
>>> s1 = raw_input('first word: ')
first word: ATCGTAATACGATCGTACGACATCGCGGCCCTAGC
>>> s2 = raw_input('second word: ')
first word: TACGAT
>>> Levenshtein.distance(s1,s2)
29
In this example I would like to obtain the position (7) and the distance (0 in this case).
Is there an easy way to solve this problem, or do I have to break the larger sequences into smaller ones and then run the Levenshtein distance for all of them? That might take too much time.
Thanks.
UPDATE #Naive implementation generating all substrings after looking for an exact match.
def find_tag(pattern,text,errors):
m = len(pattern)
i=0
min_distance=errors+1
while i<=len(text)-m:
distance = Levenshtein.distance(text[i:i+m],pattern)
print text[i:i+m],distance #to see all matches.
if distance<=errors:
if distance<min_distance:
match=[i,distance]
min_distance=distance
i+=1
return match
#Real example. In this case just looking for one pattern, but we have about 50.
import re, Levenshtein
text = 'GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACATTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACACAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAGGGGATAGG' #Example of a record
pattern = 'TGCACTGTAGGGATAACAAT' #distance 1
errors = 2 #max errors allowed
match = re.search(pattern,text)
if match:
print [match.start(),0] #First we look for exact match
else:
find_tag(pattern,text,errors)