5
votes

I'm trying to compare column 1 from file1 and column 3 from file 2, if they match then print the first column from file1 and the two first columns from file2.

here's a sample from each file:

file1

Cre01.g000100   
Cre01.g000500  
Cre01.g000650  

file2

chromosome_1    71569  |655|Cre01.g000500|protein_coding|CODING|PAC:26902937|1|1)
chromosome_1    93952  |765|Cre01.g000650|protein_coding|CODING|PAC:26903448|11|1)
chromosome_1    99034  |1027|Cre01.g000100 |protein_coding|CODING|PAC:26903318|9|1)

desired output

Cre01.g000100  chromosome_1    99034        
Cre01.g000500  chromosome_1    71569   
Cre01.g000650  chromosome_1    93952

I've been looking at various threads that are somewhat similar, but I can't seem to get it to print the columns from both files. Here are some links that are somewhat related:

awk compare 2 files, 2 fields different order in the file, print or merge match and non match lines

Obtain patterns from a file, compare to a column of another file, print matching lines, using awk

awk compare columns from two files, impute values of another column

Obtain patterns in one file from another using ack or awk or better way than grep?

Awk - combine the data from 2 files and print to 3rd file if keys matched

I feel like I should have been able to figure it out based on these threads, but it's been two days that I've been trying different variations of the codes and I haven't gotten anywhere. Here is some code that I've tried using on my files:

awk 'FNR==NR{a[$3]=$1;next;}{print $0 ($3 in a ? a[$3]:"NA")}' file1 file2

awk 'NR==FNR{ a[$1]; next} ($3 in a) {print $1 $2 a[$1]}' file1 file2

awk 'FNR==NR{a[$1]=$0; next}{print a[$1] $0}' file1 file2

I know i have to create a temp matrix that contains the first column of file1 (or the 3rd column of file2) then compare it to the other file. If there is a match, then print first column from file1 and column 1 and column 2 from file 2.

Thanks for the help!

2

2 Answers

2
votes

You can use this awk:

awk -F '[| ]+' -v OFS='\t' 'NR==FNR{a[$4]=$1 OFS $2; next}
       $1 in a{print $1, a[$1]}' file2 file1
Cre01.g000100   chromosome_1    99034
Cre01.g000500   chromosome_1    71569
Cre01.g000650   chromosome_1    93952
0
votes

Your middle attempt of the three is closest, but:

  • You haven't specified the field delimiter is |.
  • You don't assign to a[$1].
  • Your sample output is inconsistent with your desired output (the sample output shows column 1 from file 1 and column 1 from file 2; the desired output is reputedly column 1 from file 1 and columns 1 and 2 from file 2, though this interpretation depends on the interpretation of $3 in file 2 being the name between two pipe symbols).

    Citing the question at the time this answer was created:

    … compare column 1 from file1 and column 3 from file 2, if they match then print the first column from file1 and the two first columns from file2.

    desired output
    Cre01.g000100  chromosome_1    99034
    Cre01.g000500  chromosome_1    71569
    Cre01.g000650  chromosome_1    93952
    
  • We can observe that if $3 in file 2 is equal to a value from file 1, then it is as easy to print $3 as a saved value.

So, fixing this up:

awk -F'|' 'NR==FNR { a[$1]=1; next } ($3 in a) { print $3, $1 }' file1 file2

The key change is the assignment to a[$1] (and the -F'|'); the rest is cosmetic and can be tweaked to suit your requirements (since the question is self-inconsistent, it is hard to give a better answer).