DNA to Protein | translation incorrection

Question

I had no error. Always refresh cache and local memory.

Resources for Verifying Translations:

[NCBI Protein Translation Tool][1] (Validation)

[Text Compare][2] (Verification)

[Solution Inspiration][3]

300 DNA chars -> 100 protein chars.

# dna_sequence = above sequence

dna_codons = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
    'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
} # replaced '_' for '*'

output_protein = ''
for i in range(0, len(dna_sequence), 3):
  codon = dna_sequence[i:i+3]
  output_protein += dna_codons.get(codon,'')
    
print(output_protein)

Example Problem & Solution: ps://www.geeksforgeeks.org/dna-protein-python-3/

The first issue I’m seeing is that your code doesn’t look for a start codon, it incorrectly starts translating at the first nucleotide. Your NCBI solution seems to be doing the same though, which is surprising. (Another thing: it’s unnecessary to truncate your DNA sequence length to a multiple of 3; omitting this step won’t change the result). — Konrad Rudolph
Ok, interesting insights. I would like my solution to translate the same way as NCBI. I won't need start codons, as really I'm translating entire sequences. — StressedBio2
Anyway, I can’t reproduce the issue. Your code contains several errors which prevent it from executing (indentation, variable names not matching) but once these are fixed the output is identical to that from NCBI/EBI/…, except for _ instead of *. So I really don’t understand where your alleged output is coming from. — Konrad Rudolph
I'll have a look over my solution again and update this post asap — StressedBio2

David Parry David Parry · Accepted Answer · 2021-03-31T09:34:59

I think the issue is with you mixing up variable names - your translation code appends to protein but you print output_protein which I assume is actually created somewhere else in your code(?). Also, you first edit the variable dna_sequence but iterate over dna which I assume is also defined elsewhere and maybe doesn't match dna_sequence.

After editing the variable names I can use your code to get the same translation as the NCBI tool.

dna_sequence = 'TTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACG'
# dna_sequence = above sequence
if len(dna_sequence ) % 3 == 2: dna_sequence = dna_sequence [:-2]
if len(dna_sequence ) % 3 == 1: dna_sequence = dna_sequence [:-1]

#%% 
dna_codons = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                 
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
    'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
}

output_protein = ""
for i in range(0, len(dna_sequence), 3):
    codon = dna_sequence[i:i+3]
    output_protein += dna_codons.get(codon,'')

print(output_protein)

ncbi = 'L*ICSLNEL*NLCGCHSAACLVHSRSIINN*LLSLTGHE*LVYLLQALTVSSVLQPIISTSRFCPGVTER*DGEPCPWFQRENTRPTQFACFTGSRRART'
if ncbi == output_protein:
    print("Matches")

DNA to Protein | translation incorrection

1 Answers