I have got a gene bank file .gbk from which I want to extract certain genes. My problem is the following: In order to process the file, the header for each locus must be in a specific format, and it is not in my file. I want to parse the file and replace the headers as following:
LOCUS NODE_1_length_393688_cov_17.8554393688 bp DNA linear
BCT22-MAY-2017
DEFINITION Escherichia coli strain strain.
ACCESSION
VERSION
KEYWORDS .
SOURCE Escherichia coli
ORGANISM Escherichia coli
Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
Escherichia.
....
>>Gene data here
....
LOCUS NODE_2_length_278889_cov_17.85545278889 bp DNA linear
BCT22-MAY-2017
DEFINITION Escherichia coli strain strain.
ACCESSION
VERSION
KEYWORDS .
SOURCE Escherichia coli
ORGANISM Escherichia coli
Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
Escherichia.
....
>>Gene data here
....
LOCUS NODE_3_length_340008_cov_17.855432340008 bp DNA linear
BCT22-MAY-2017
DEFINITION Escherichia coli strain strain.
ACCESSION
VERSION
KEYWORDS .
SOURCE Escherichia coli
ORGANISM Escherichia coli
Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
Escherichia.
....
>>Gene data here
....
The string commencing with NODE
is too long for the file format convention and needs to be replaced so it looks like that:
LOCUS NODE_1_393688 bp DNA linear
....
LOCUS NODE_2_278889 bp DNA linear
....
LOCUS NODE_3_340008 bp DNA linear
The part that needs to be cut out is not necessary of the same lenght, so a fixed approach removing everything between certain positions of the string is not feasible. I have tried different approaches using re.compile() and r.sub() but have not been successful so far.
Any help would be highly appreciated. Thank you for your time!