How to merge files based on common string (in different column number) using awk

Question

file1:

000001 c-2-3 p045 238744
000001 c-2-4 p042 439709
000002 c-2-4 p055 234744
000003 c-2-5 p099 956755
000004 c-2-9 p064 504435
000005 c-1-5 p043 384029
000006 c-2-2 p011 434444
000009 c-1-3 p083 035905

file2:

000001 1 0 0 rs333 HESN
000002 1 0 0 rs333 POS
000003 1 0 0 rs333 POS
000004 0 1 0 rs333 POS
000005 0 0 1 rs333 NEG
000008 1 0 0 rs333 POS

The following awk command:

awk 'NR==FNR {h[$1] = $0; next} {print $1,$2,$3,$4,h[$1]}' file2 file1 > file3

Yields the following file:

file3:

000001 c-2-3 p045 238744 000001 1 0 0 rs333 HESN
000001 c-2-4 p042 439709 000001 1 0 0 rs333 HESN
000002 c-2-4 p055 234744 000002 1 0 0 rs333 POS
000003 c-2-5 p099 956755 000003 1 0 0 rs333 POS
000004 c-2-9 p064 504435 000004 0 1 0 rs333 POS
000005 c-1-5 p043 384029 000005 0 0 1 rs333 NEG
000006 c-2-2 p011 434444
000009 c-1-3 p083 035905

However, file1 actually looks like this:

file1b:

c-2-3 p045 238744 000001
c-2-4 p042 439709 000001
c-2-4 p055 234744 000002
c-2-5 p099 956755 000003
c-2-9 p064 504435 000004
c-1-5 p043 384029 000005
c-2-2 p011 434444 000006
c-1-3 p083 035905 000009

How would I change the awk command to accept file1b (instead of file1) and get the same output (file3). Also, how would I exclude the redundant information in file3 (i.e. column 5)?

Desired output using file1b and file2:

000001 c-2-3 p045 238744 1 0 0 rs333 HESN
000001 c-2-4 p042 439709 1 0 0 rs333 HESN
000002 c-2-4 p055 234744 1 0 0 rs333 POS
000003 c-2-5 p099 956755 1 0 0 rs333 POS
000004 c-2-9 p064 504435 0 1 0 rs333 POS
000005 c-1-5 p043 384029 0 0 1 rs333 NEG
000006 c-2-2 p011 434444
000009 c-1-3 p083 035905

Thanks!!

karakfa karakfa · Accepted Answer · 2017-03-23T00:46:26

awk to the rescue!

awk 'NR==FNR {k=$1; $1=""; a[k]=$0; next} 
             {k=$NF; NF--; print k,$0 a[k]}' file2 file1b 

000001 c-2-3 p045 238744 1 0 0 rs333 HESN
000001 c-2-4 p042 439709 1 0 0 rs333 HESN
000002 c-2-4 p055 234744 1 0 0 rs333 POS
000003 c-2-5 p099 956755 1 0 0 rs333 POS
000004 c-2-9 p064 504435 0 1 0 rs333 POS
000005 c-1-5 p043 384029 0 0 1 rs333 NEG
000006 c-2-2 p011 434444
000009 c-1-3 p083 035905

there are some subtle tricks employed to simplify the code...

How to merge files based on common string (in different column number) using awk

1 Answers