0
votes

I thought I understand how to merge files. Could you help me tell, where is the problem in my script?

INPUT:

file1

 c.001A>C   11  p.His103    A
 c.000A>C   12  p.Thr102    E
 c.111A>C   12  p.Thr102    D

file2

APC 5   112835056   ENST00000507379 13  c.001A>C    p.Val599Phe  
APC 5   112819143   ENST00000257430 9   c.1111G  p.G371X
APC 5   1128395514  ENST00000257430 15  c.001A>C     p.Glu1309AspfsT
APC 5   112838773   ENST00000257430 15  c.000A>C    p.Gln1062Ter

OUTPUT:

APC 5   112835056   ENST00000507379 13  c.001A>C    p.Val59   c.001A>C  11  p.His103    A
APC 5   1128395514  ENST00000257430 15  c.001A>C    p.Glu1    c.001A>C  11  p.His103    A
APC 5   112838773   ENST00000257430 15  c.000A>C    p.Gln10   c.000A>C  12  p.Thr102    E

So far, the following awk line has been attempted:

awk -F'\t' -v OFS="\t" 'FNR==NR{a[$1]=$1FS$2FS$3FS$4; next} {if ($6 in a) print $0, a[$1]}' file1 file2

I would like to merge the file1 column1 with file2 column6 and print only match and all columns from both files. But it not working. Thank you

1
Could you please more elaborate 6th column's output more? How it is reduced? Please explain. - RavinderSingh13

1 Answers

2
votes

If you're ok to use GNU tools, you merge both files with join, sort and column commands:

$ join -1 1 -2 6 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 1.3 1.4" <(sort file1) <(sed 's/ \+/ /g' file2 | sort -k6) | column -t
APC  5  112838773   ENST00000257430  15  c.000A>C  p.Gln1062Ter     p.Thr102  E
APC  5  1128395514  ENST00000257430  15  c.001A>C  p.Glu1309AspfsT  p.His103  A
APC  5  112835056   ENST00000507379  13  c.001A>C  p.Val599Phe      p.His103  A

join command merge files based on first column of file1 and 6th column of file2. This command expects sorted input on both files which is done with <(...). The option -o lists all column you want to display.

Note the sed command removesduplicate white spaces in order to have the right column number for sort (on the 2nd file).

At last column -t is nicely displaying fields in a column style.