3
votes

I am trying to use awk to extract and print the first ocurrence of NM_ and the portion after theNP_ starting with p.. A : is printed instead of the "|" for each. The input file is tab-delimeted, but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. There maybe multiple NM or NP in my actual data of over 5000 lines, however only the first occurence of each is extracted and printed. I am still a little unclear on the RSTART and RLENGHTH concepts but, using line 1 as an example from the input:

The NM variable would be NM_020469.2

The NP variable would be :p.Gly268Arg

I have included comments as well. Thank you :).

input

Input Variant   HGVS description(s) Errors and warnings
rs41302905  NC_000009.11:g.136131316C>T|NM_020469.2:c.802G>A|NP_065202.2:p.Gly268Arg
rs8176745   NC_000009.11:g.136131347G>A|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=

desired output

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

awk

awk -F'[\t|]' 'NR>1{ # define FS as tab and `|` to split each, and skip header line
              r=$1; nm=np="";  # create variable r with $1 and 2 variables (one for nm and the other for np, setting them to null)
              for(i=2;i<=NF;i++) { # start a loop from line2 and itterate
                  if ($i~/^NM_/) nm=$i;  # extract first NM_ in line and read into i
                  else if ($i~/^NP_/) np=substr($i,index($i,":")); # extract NP_ and print portion after : (including :)
                  if (nm && np) { print r,nm np; break }  # print desired output
              }
          }' input
5
still talking about "patterns" huh? Sigh.... Also if There maybe multiple NM or NP in my actual data then show at least 2 in your sample data otherwise you're inviting solutions to the wrong problem. - Ed Morton

5 Answers

1
votes

Awk solution:

awk -F'[\t|]' 'NR>1{
                  r=$1; nm=np="";
                  for(i=2;i<=NF;i++) {
                      if ($i~/^NM_/) nm=$i;
                      else if ($i~/^NP_/) np=substr($i,index($i,":"));
                      if (nm && np) { print r,nm np; break } 
                  }
              }' input

  • 'NR>1 - start processing from the 2nd record

  • r=$1; nm=np="" - initialization of the needed variables

  • for(i=2;i<=NF;i++) - iterating through the fields (starting from the 2nd)

  • if ($i~/^NM_/) nm=$i - capturing NM_... item into variale nm

  • else if ($i~/^NP_/) np=substr($i,index($i,":")) - capturing NP_... item into variale np (starting from : till the end)

  • if (nm && np) { print r,nm np; break } - if both items has been captured - print them and break the loop to avoid further processing


The output:

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
1
votes

Could you please try following and let me know if this helps too.

awk '{
match($0,/NM_[^|]*/);
nm=substr($0,RSTART,RLENGTH);
match($0,/NP_([^|]|[^$])*/);
np=substr($0,RSTART,RLENGTH);
split(np, a,":");
  if(nm && np){
    print $1,nm ":" a[2]
}
}
'   Input_file

Output will be as follows.

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

PS: Since your sample Input_file doesn't have TAB in them so you could add "\t" after awk in case your Input_file is TAB delimited and if you want to have output as TAB delimited too, add OFS="\t" before Input_file.

1
votes

Short GNU awk solution (with match function):

awk 'match($0,/(NM_[^|]+).*NP_[^:]+([^[:space:]|]+)/,a){ print $1,a[1] a[2] }' input

The output:

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
1
votes

Given your posted sample input, this is all you need to produce your desired output:

$ awk -F'[\t|]+' 'NR>1{sub(/[^:]+/,"",$4); print $1, $3 $4}' file
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

If that's not all you need then provide more truly representative input/output.

1
votes

Another alternative awk proposal.

awk 'NR>1{sub(/\|/," ")sub(/\|NP_065202.2/,"");print $1,$3,$4}' file

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=