Compare two columns in different files and append data for the shared items - UNIX

Question

I have been searching in the forum a solution for my problem but unfortunately I was not successful. I am quite green in bash; therefore I would really appreciate any suggestion ot link to previous question/discussions

I have two files:

file1

    Ada     ada
    Ada     ada
    Ada     aidB
    Ada     alkA
    Ada     alkB
    Ada     alkB
    AdiY    adiA
    AdiY    gadA

file2

    NP_414557.1     insL1
    NP_414559.1     mokC
    YP_025292.1     hokC
    NP_414560.1     nhaA
    NP_414561.1     nhaR
    NP_414562.1     insB1
    NP_414563.1     insA
    NP_414564.1     rpsT

I would like to compare file1 and file2 based on the second column. If the object is the same I would like to append the text in the column 1 in file2 relative to that specific shared entry in a third column in file1.

Expected output: file1

    PhoB  caiT  NP_414580.1     
    PhoP  caiE  NP_414581.1     
    PhoP  caiB  NP_414582.1

Please consider the text reported just as example.

I am struggling with awk trying to adjust the code from a previouse question:

Compare two files of different columns and print different columns

awk 'NR==FNR{a[$1]=$2 ; next }$2 in a {$1=$1 FS a[$2]print}' file1 file2

Thank you very much in advance for your time and your help.

Best.

Your sample input looks a bit truncated. Where is caiT coming from? — fedorqui 'SO stop harming'
Your file1 and file2 appear to have absolutely nothing in common, and your proposed output has nothing in common with either of them. Consider posting some example data that at least has a minimal chance of demonstrating what you are asking about... — twalberg
Hi, thank you for your answer. As I wrote the data I reported are just example. The second column file1, file2, and the output have the same kind of information. Just imagine that the PhoB and caiT will be present in file1 and caiT and NP_414508.1 are present in file2. I hope this makes it clearer. — efrem

Josh Jolly Josh Jolly · Accepted Answer · 2014-05-07T11:15:45

As @fedorqui commented, your example inputs/output are not consistent. I think this should do the trick though:

awk 'NR==FNR{a[$2]=$0; next} a[$2]>0{print a[$2],$1}' file1 file2

file1:

A alice
B bob
C carol
D dan

file2:

1 dan
2 alice
3 carol
4 bob

Output:

$ awk 'NR==FNR{a[$2]=$0} NR>FNR && a[$2]>0{print a[$2],$1}' file1 file2
D dan 1
A alice 2
C carol 3
B bob 4

Output can be sorted by whatever column you choose using sort -k. Breaking down the awk code:

NR==FNR{a[$2]=$0; next} - NR is an awk variable which contains the total number of lines processed so far. FNR is similar, but only contains the number of processes lines of the current file, so this condition effectively means "only do this for the first input file". The associated action stores the entire line ($0) in the associative array a, with the index being the value of the second field of the row. next just means that awk should move to the next line without doing further processing.
a[$2]>0{print a[$2],$1} - this condition will only be tested for the second and subsequent input files. The second field is used as an index to look up a value from the a - if the value is greater than 0 (ie, the value is a string in this case) then the value is printed, followed by the first field of the current line.

Basically, every line of the first file is stored in an array, indexed by the second field of the line. If that second field is matched in the second file, then the whole line from the first file is printed, with the second field from the second file appended.

Compare two columns in different files and append data for the shared items - UNIX

2 Answers