Limiting re.findall() to record values before a certain number. Python

Question

sequence_list = ['atgttttgatggATGTTTGATTAG','atggggtagatggggATGGGGTGA','atgaaataatggggATGAAATAA']

I take each element (fdna) from sequence_list and search for sequences starting with ATG and then reading by 3's until it reaches either a TAA, TGA, or TAG

Each element in sequence_list is made up of two sequences. the first sequence will be lowercase and the second will be uppercase. the results string is composed of lowercase + UPPERCASE

Gathering CDS Starts & Upper()

cds_start_positions = []
cds_start_positions.append(re.search("[A-Z]", fdna).start())
fdna = fdna.upper()

So after I find where the uppercase sequence starts, I record the index number in cds_start_positions and then convert the entire string (fdna) to uppercase

This statement gathers all ATG-xxx-xxx- that are followed by either a TAA|TAG|TGA

Gathering uORFs

ORF_sequences = re.findall(r'ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)

So what I'm trying to do is gather all occurrences where ATG-xxx-xxx is followed by either TAA, TGA, or TAG.

My input data is composed of 2 sequences (lowercaseUPPERCASE) and I want to find these sequences when:

1: the ATG is followed by TAA|TGA|TAG in the lowercase (which are now uppercase but the value where they become uppercase is stored in the cds_start_positions)

2: the ATG is in the lowercase portion (less than the cds_start_position value) and the next TAA|TGA|TAG that is following it in uppercase.

NOTE* the way it is set up now is that an ATG that was in the original uppercase portion (greater than the cds_start_position value) is saved to list

What the "Gathering CDS Starts & Upper()" does is find where the upper case sequence starts.

Is there any way to put restraints on the "Gathering uORFs" part to where it only recognizes ATG in the position before the corresponding element in the list cds_start_positions?

I want want to put a statement in the ORF_sequences line where it only finds 'ATG' before each element in the list 'cds_start_positions'

Example of what cds_start_positions would look like

cds_start_positions = [12, 15, 14] #where each value indicates where the uppercase portion starts in the sequence_list elements (fdna)

for the first sequence in sequence_list i would want this result:

#input
fdna = 'atgttttgatggATGTTTGATTAG'
#what i want for the output
ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA']
#what i'm getting for the output
ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA','ATGTTTGATTAG']

that 3rd entry is found in after the value 12 (corresponding value in the list cds_start_positions) and i don't want that. However, the 2nd entry has its starting ATG before that value 12 and its TAA|TGA|TAG after the value 12 which should be allowed.

***Note I have another line of code that just takes the start positions of where these ATG-xxx-xxx-TAA|TGA|TAG occur and that is:

start_positions = [i for i in start_positions if i < k-1]

Is there a way to use this principle in re.findall ?

let me know if i need to clarify anything

It would be helpful if you gave the smallest example of what your problem is, showing what you have and what you want and what won't work. I find it confusing with so much problem domain specific language being used. It seems you've got a list of strings and you want something from them and your findall seemed to give back values but I don't know what the problem is with them. — sotapme
So, basically you want lowercase sequence and a sequence with both, am I right? — ATOzTOA
@greg When I run the script it takes ATG-xxx-xxx-TAA|TGA|TAG sequences from the combined lowercase+uppercase list. Even sequences that start AFTER the corresponding value in the cds_start_position (the number) — O.rka
@sotapme thanks, so the smallest example (i forgot to include it but will add it in now) is that i'm getting the ATG-xxx-xxx-TAA|TGA|TAG sequence added to my ORF_sequence list when it is after the cds_start_position value. — O.rka

eyquem eyquem · Accepted Answer · 2013-02-14T20:22:09

Yesterday, I had written a first answer.

Then I read the answer of ATOzTOA in which he had a very good idea: using a positive look-behind assertion.
I thought that my answer was completely out and that his idea was the right way to do.
But afterward, I realized that there's a flaw in the ATOzTOA's code.

Say there is a portion 'ATGxzyxyzyyxyATGxzxzyxyzxxzzxzyzyxyzTGA' in the examined string: the positive matching will occur on 'xzyxyzyyxyATGxzxzyxyzxxzzxzyzyxyzTGA' and the assertive matching on the preceding 'ATG' so the portion will constitute a match; that's OK.
But it means that just after this matching the regex motor is positionned at the end of this portion 'xzyxyzyyxyATGxzxzyxyzxxzzxzyzyxyzTGA' .
So when the regex motor will search for the next match, it won't find a matching beginning at this 'ATG' present in this portion, since it runs again from a position long after it.

.

So the only way to achieve what is required by the question is effectively the first algorithm I had written, then I repost it.

The job is done by a function find_ORF_seq()

If you pass True as a second argument to the second parameter messages of the function find_ORF_seq() , it will print messages that help to understand the algorithm.
If not, the parameter messages takes the default value None

The pattern is written '(atg).+?(?:TAA|TGA|TAG)' with some letters uppercased and the others lowercased, but it's not the reason why the portions are catched correctly relatively to the up and low cased letters. Because, as you will see, the flag re.IGNORECASE is used: this flag is necessary since the part matched by (?:TAA|TGA|TAG) can fall in the lower cased part as well as in the upper cased part.

The essence of the algorithm lies in the while-loop, which is necessary because of the fact the researched portions may overlap as I explained above (as far as I understood correctly and the samples and explanations you gave are correct) .
So there is no possibility to use findall() or finditer() and I do a loop.

To avoid to iterate in the fdna sequence one base after the other, I use the ma.start() method that gives the position of the beginning of a match ma in a string, and I increment the value of s with s = s + p + 1 ( +1 to not begin to search again at the start of the found match !)

My algorithm doesn't need the information of start_positions because I don't use an look-behind assertion but a real matching on the first 3 letters: a match is declared unfitting with constraints when the start of the match is in the uppercased part, that it to say when ma.group(1) that catches the first three bases (that can be 'ATG' or 'atg' since the regex ignore case) is equal to 'ATG'

I was obliged to put s = s + p + 1 instead of s = s + p + 3 because it seems that the portions you search are not spaced by multiple of three bases.

import re

sequence_list = ['atgttttgatgATGTTTTGATTT',
                 'atggggtagatggggATGGGGTGA',
                 'atgaaataatggggATGAAATAA',
                 'aaggtacttctcggctaACTTTTTCCAAGT']

pat = '(atg).+?(?:TAA|TGA|TAG)'
reg = re.compile(pat,re.IGNORECASE)

def find_ORF_seq(fdna,messages=None,s=0,reg=reg):
    ORF_sequences = []
    if messages:
        print 's before == ',s
    while True:
        if messages:
            print ('---------------------------\n'
                   's == %d\n'
                   'fdna[%d:] == %r' % (s,s,fdna[s:]))
        ma = reg.search(fdna[s:])
        if messages:
            print 'reg.search(fdna[%d:]) == %r' % (s,ma)
        if ma:
            if messages:
                print ('ma.group() == %r\n'
                       'ma.group(1) == %r'
                       % (ma.group(),ma.group(1)))
            if ma.group(1)=='ATG':
                if messages:
                    print "ma.group(1) is uppercased 'ATG' then I break"
                break
            else:
                ORF_sequences.append(ma.group().upper())
                p = ma.start()
                if messages:
                    print (' The match is at position p == %d in fdna[%d:]\n'
                           ' and at position s + p == %d + %d == %d in fdna\n'
                           ' then I put s = s + p + 1 == %d'
                           % (p,s, s,p,s+p, s+p+1))
                s = s + p + 1
        else:
            break
    if messages:
        print '\n==== RESULT ======\n'
    return ORF_sequences

for fdna in sequence_list:
    print ('\n============================================')
    print ('fdna == %s\n'
           'ORF_sequences == %r'
           % (fdna, find_ORF_seq(fdna,True)))

###############################

print '\n\n\n######################\n\ninput sample'
fdna = 'atgttttgatggATGTTTGATTTATTTTAG'
print '  fdna == %s' % fdna
print '  **atgttttga**tggATGTTTGATTTATTTTAG'
print '  atgttttg**atggATGTTTGA**TTTATTTTAG'
print 'output sample'
print "  ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA']"

print '\nfind_ORF_seq(fdna) ==',find_ORF_seq(fdna)

.

The same function without the print instructions to better see the algorithm.

import re

pat = '(atg).+?(?:TAA|TGA|TAG)'
reg = re.compile(pat,re.IGNORECASE)

def find_ORF_seq(fdna,messages=None,s =0,reg=reg):
    ORF_sequences = []
    while True:
        ma = reg.search(fdna[s:])
        if ma:
            if ma.group(1)=='ATG':
                break
            else:
                ORF_sequences.append(ma.group().upper())
                s = s + ma.start() + 1
        else:
            break
    return ORF_sequences

.

I compared the two functions, ATOzTOA's one and mine, with a fdna sequence revealing the flaw. This legitimates what I described.

from find_ORF_sequences import find_ORF_seq
from ATOz_get_sequences import getSequences

fdna = 'atgggatggtagatggatgggATGGGGTGA'

print 'fdna == %s' % fdna
print 'find_ORF_seq(fdna)\n',find_ORF_seq(fdna)
print 'getSequences(fdna)\n',getSequences(fdna)

result

fdna == atgggatggtagatggatgggATGGGGTGA
find_ORF_seq(fdna)
['ATGGGATGGTAG', 'ATGGTAG', 'ATGGATGGGATGGGGTGA', 'ATGGGATGGGGTGA']
getSequences(fdna)
['ATGGGATGGTAG', 'ATGGATGGGATGGGGTGA']

.

But after all, maybe, I wonder.... :
do you want the matches that are inner parts of another matching, like 'ATGGGATGGGGTGA' at the end of 'ATGGATGGGATGGGGTGA' ?

If not, the answer of ATOzTOA will fit also.

Limiting re.findall() to record values before a certain number. Python

Gathering CDS Starts & Upper()

Gathering uORFs

2 Answers