1
votes

Goal: To print lines of File2 when field 1 ($1) and field 4 ($4) of File1 both match a substring in field 4 ($4) on lines beginning with ">" in File2.

Important note #1: The lines being printed to output include the line being searched and all the lines following it until the next line with a ">".

Example: When fields 1 and 4 of File1 are 2776 & 2968 respectively, these should be searched against field 4 of File2 to evntually find the match 2776-2968(+) (because both numbers of File1 match a substring in field 4 of File2). The order of the numbers in the string does not matter - 2968-2776(+) should also be considered a match. Since they match, that line of File2 is printed with all lines below it until another line with ">" is encountered.

Important Note #2: File1 is tab-delimited: \t. File 2 is colon-delimited: :.


File1:

Transcription_Start     Translation_Start       Translation_Stop        Transcription_Stop      Strand  Expression
2776                    2968    +       920
17374                   17563   +       1959
2968                    2786    -       802
17563                   17375   -       1694
19606                   19395   -       1914

File2:

>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA

Desired Output:

>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC

This is what I've tried so far (it outputs the full contents of File2, thus failing to produce the desired output):

$ awk -F"\t|:" 'NR==FNR{a[$4]; next} ($1 in a) || ($4 in a)' File1 File2 > Output
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA

How can I process my files with awk (or similar) to achieve my goal?

1

1 Answers

5
votes

With your shown samples, please try following. Written and tested with GNU awk.

awk '
FNR==NR{
  arr[$1,$2]
  next
}
/^>/{
  found=""
  if((($5,$6) in arr) || (($6,$5) in arr)){
    found=1
  }
}
found
' file1 FS=":|-|\\\\("  file2

Explanation: Adding detailed explanation for above.

awk '                             ##Starting awk program from here.
FNR==NR{                          ##Checking condition which will be TRUE when file1 is being read.
  arr[$1,$2]                      ##Creating arr with index of 1st and 2nd field.
  next                            ##next will skip all further statements from here.
}
/^>/{                             ##Checking condition if line starts from > then do following.
  found=""                        ##Nullifying found here.
  if((($5,$6) in arr) || (($6,$5) in arr)){  ##Checking condition if either 5th 6th field is present in arr OR 6th 5th field as a key present in arr then do following.
    found=1                       ##Setting found to 1 here.
  }
}
found                             ##Checking condition if found is set then print that line.
' file1 FS=":|-|\\\\("  file2     ##Mentioning Input_file(s) and setting field separator before Input_file2 to get exact values to match.