sequence_list = ['atgttttgatggATGTTTGATTAG','atggggtagatggggATGGGGTGA','atgaaataatggggATGAAATAA']
I take each element (fdna) from sequence_list and search for sequences starting with ATG and then reading by 3's until it reaches either a TAA, TGA, or TAG
Each element in sequence_list is made up of two sequences. the first sequence will be lowercase and the second will be uppercase. the results string is composed of lowercase + UPPERCASE
Gathering CDS Starts & Upper()
cds_start_positions = []
cds_start_positions.append(re.search("[A-Z]", fdna).start())
fdna = fdna.upper()
So after I find where the uppercase sequence starts, I record the index number in cds_start_positions and then convert the entire string (fdna) to uppercase
This statement gathers all ATG-xxx-xxx- that are followed by either a TAA|TAG|TGA
Gathering uORFs
ORF_sequences = re.findall(r'ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
So what I'm trying to do is gather all occurrences where ATG-xxx-xxx is followed by either TAA, TGA, or TAG.
My input data is composed of 2 sequences (lowercaseUPPERCASE) and I want to find these sequences when:
1: the ATG is followed by TAA|TGA|TAG in the lowercase (which are now uppercase but the value where they become uppercase is stored in the cds_start_positions)
2: the ATG is in the lowercase portion (less than the cds_start_position value) and the next TAA|TGA|TAG that is following it in uppercase.
NOTE* the way it is set up now is that an ATG that was in the original uppercase portion (greater than the cds_start_position value) is saved to list
What the "Gathering CDS Starts & Upper()" does is find where the upper case sequence starts.
Is there any way to put restraints on the "Gathering uORFs" part to where it only recognizes ATG in the position before the corresponding element in the list cds_start_positions?
I want want to put a statement in the ORF_sequences line where it only finds 'ATG' before each element in the list 'cds_start_positions'
Example of what cds_start_positions would look like
cds_start_positions = [12, 15, 14] #where each value indicates where the uppercase portion starts in the sequence_list elements (fdna)
for the first sequence in sequence_list i would want this result:
#input
fdna = 'atgttttgatggATGTTTGATTAG'
#what i want for the output
ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA']
#what i'm getting for the output
ORF_sequences = ['ATGTTTTGA','ATGGATGTTTGA','ATGTTTGATTAG']
that 3rd entry is found in after the value 12 (corresponding value in the list cds_start_positions) and i don't want that. However, the 2nd entry has its starting ATG before that value 12 and its TAA|TGA|TAG after the value 12 which should be allowed.
***Note I have another line of code that just takes the start positions of where these ATG-xxx-xxx-TAA|TGA|TAG occur and that is:
start_positions = [i for i in start_positions if i < k-1]
Is there a way to use this principle in re.findall ?
let me know if i need to clarify anything
findallseemed to give back values but I don't know what the problem is with them. - sotapme