I am trying to use awk
to extract and print the first ocurrence of NM_
and the portion after theNP_
starting with p.
. A :
is printed instead of the "|" for each. The input file is tab-delimeted
, but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. There maybe multiple NM
or NP
in my actual data of over 5000 lines, however only the first occurence of each is extracted and printed. I am still a little unclear on the RSTART
and RLENGHTH
concepts but, using line 1 as an example from the input:
The NM
variable would be NM_020469.2
The NP
variable would be :p.Gly268Arg
I have included comments as well. Thank you :).
input
Input Variant HGVS description(s) Errors and warnings
rs41302905 NC_000009.11:g.136131316C>T|NM_020469.2:c.802G>A|NP_065202.2:p.Gly268Arg
rs8176745 NC_000009.11:g.136131347G>A|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=
desired output
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
awk
awk -F'[\t|]' 'NR>1{ # define FS as tab and `|` to split each, and skip header line
r=$1; nm=np=""; # create variable r with $1 and 2 variables (one for nm and the other for np, setting them to null)
for(i=2;i<=NF;i++) { # start a loop from line2 and itterate
if ($i~/^NM_/) nm=$i; # extract first NM_ in line and read into i
else if ($i~/^NP_/) np=substr($i,index($i,":")); # extract NP_ and print portion after : (including :)
if (nm && np) { print r,nm np; break } # print desired output
}
}' input
There maybe multiple NM or NP in my actual data
then show at least 2 in your sample data otherwise you're inviting solutions to the wrong problem. - Ed Morton