2
votes

I'm trying to convert a DNA sequence to an amino acid sequence. I have a dictionary of codons:

codon_mapping = {'AAA': 'K','AAC': 'N','AAG': 'K','AAT': 'N','ACA': 'T','ACC': 'T','ACG': 'T','ACT': 'T','AGA': 'R','AGC': 'S','AGG': 'R','AGT': 'S','ATA': 'I','ATC': 'I','ATG': 'M','ATT': 'I','CAA': 'Q','CAC': 'H','CAG': 'Q','CAT': 'H','CCA': 'P','CCC': 'P','CCG': 'P','CCT': 'P','CGA': 'R','CGC': 'R','CGG': 'R','CGT': 'R','CTA': 'L','CTC': 'L','CTG': 'L','CTT': 'L','GAA': 'E','GAC': 'D','GAG': 'E','GAT': 'D','GCA': 'A','GCC': 'A','GCG': 'A','GCT': 'A','GGA': 'G','GGC': 'G','GGG': 'G','GGT': 'G','GTA': 'V','GTC': 'V','GTG': 'V','GTT': 'V','TAA': '*','TAC': 'Y','TAG': '*','TAT': 'Y','TCA': 'S','TCC': 'S','TCG': 'S','TCT': 'S','TGA': '*','TGC': 'C','TGG': 'W','TGT': 'C','TTA': 'L','TTC': 'F','TTG': 'L','TTT': 'F'}

And an input sequence:

seq = 'ATGTATGGCTAGCTTACTACTGCGCACTGATGTGGCTATCGATCGCTGGTCGTTGCTGACCGAGCTAAA'

I currently have this code:

#import re
import re

#find the start codons in the sequence
starts=[m.start() for m in re.finditer('ATG', seq)]

#establish new dictionary
seqDictionary={}
#translate sequences
for i in starts:
    mySeq=seq[i:]
    translated=''
    for n in range(0, len(mySeq), 3):
        print(mySeq[n:n+3])
        if codon_mapping[mySeq[n:n+3]] != '*':
            translated += codon_mapping[mySeq[n:n+3]]
        if codon_mapping[seq[n:n+3]] == '*':
            break 
    print("translated: " + translated)
    seqDictionary[i]=(translated)
print(seqDictionary)
            
AA_frame1 = seqDictionary[0] 
AA_frame2 = seqDictionary[4] 
AA_frame3 = seqDictionary[29]
AA_longest = None 

the problem is that for the second and third sequences (from positions 4 and 29, respectively), the for-loop exits after the fourth amino acid, even though those are not stop codons.

The output of the above code is:

ATG
TAT
GGC
TAG
translated: MYG
ATG
GCT
AGC
TTA
translated: MASL
ATG
TGG
CTA
TCG
translated: MWLS
{0: 'MYG', 4: 'MASL', 29: 'MWLS'}

I'm not getting any error messages, and I can't figure out why the loop is exiting. I know the correct solutions for the translated sequences are:

MYG
MASLLLRTDVAIDRWSLLTEL
MWLSIAGRC

Edit, this final code worked:

#import re
import re

#find the start codons in the sequence
starts=[m.start() for m in re.finditer('ATG', seq)]

#establish new dictionary
seqDictionary={}
#translate sequences
for i in starts:
    mySeq=seq[i:]
    translated=''
    for n in range(0, len(mySeq), 3):
        if len(mySeq[n:n+3]) < 3:
            break
        if codon_mapping[mySeq[n:n+3]] == '*':
            break
        else:
            translated += codon_mapping[mySeq[n:n+3]]
    seqDictionary[i]=(translated)
print(seqDictionary)

Output:

{0: 'MYG', 4: 'MASLLLRTDVAIDRWSLLTEL', 29: 'MWLSIAGRC'}
2
is it because of a typo? if codon_mapping[seq[n:n+3]] == '*':, should the seq be mySeq? - adrtam
I also suspect some confusions related to seq and mySeq. Why not if codon_mapping[mySeq[n:n+3]] != '*': #...; else: break ? - bli

2 Answers

2
votes
if codon_mapping[mySeq[n:n+3]] != '*':
    translated += codon_mapping[mySeq[n:n+3]]
if codon_mapping[seq[n:n+3]] == '*':
    break 

here you are not checking the same thing. First if is checking mySeq, second if is checking seq.

this is better written as an if else than two ifs

if codon_mapping[mySeq[n:n+3]] == '*':
    break
else:
    translated += codon_mapping[mySeq[n:n+3]]
0
votes

You have to check if the triplet is in the dictionary

for i in starts:
    mySeq=seq[i:]
    translated=''
    for n in range(0, len(mySeq), 3):
        subSeq = mySeq[n:n+3]
        print(subSeq)
        aAcid = codon_mapping.get(subSeq)
        if (not aAcid) or aAcid == '*': break
        translated += aAcid
    print("translated: " + translated)
    seqDictionary[i]=(translated)

With itertools the translation can be written in one line

import itertools
#establish new dictionary
seqDictionary={}
#translate sequences
for m in re.finditer('ATG', seq):
    start = m.start()
    translated =''.join(itertools.takewhile(lambda aa: aa and aa != '*', (codon_mapping.get(seq[n:n+3]) for n in range(start, len(seq), 3)) ))
    print("translated: " + translated)
    seqDictionary[start] = translated
print(seqDictionary)