4
votes

I have two txt files (tab delimited) I'm trying to compare with awk. I want to compare column 1 of file1 to column 5 of file2 and any time a string from file1 column 1 matches file2 column 5 I want to print (eventually to a new txt file) the entire row of file2. Theres about 14000 lines in file 1 with just one column and 40000 rows with 6 columns in file2. Below are some test files I've been using to try to figure out how to use awk to get my desired output.

file1

AAGAB
AAK1
AAMDC
AAMP
AAR2
AARD
AARS

file2

chr22   38035683    38052050    NM_018957   AAMDC   +
chrX    47001614    47004609    NM_019056   AAMP    -
chr9    21994789    22077889    NR_047539   AAR2    +
chr16   20370491    20416033    NM_174924   AARD    -
chr1    210111518   210337633   NM_001146262    AARS    +
chrX    30233674    30238206    NM_002364   MAGEB2  +
chrX    30261847    30270155    NM_002363   MAGEB1  +

I've tried adopting some other awk questions with similar desired output from the abundance of similar awk questions on SO:

awk -F '\t' 'NR==FNR{c[$1]++;next}c[$5]' file1 file2

But I only get one match for the AARS row, even though there are other matches in the test files:

chr1    210111518   210337633   NM_001146262    AARS    +

The output I am looking for is:

chr14   94463615    94473898    NR_024182   AAGAB   +
chr10   74033676    74035797    NM_019058   AAK1    +
chr22   38035683    38052050    NM_018957   AAMDC   +
chrX    47001614    47004609    NM_019056   AAMP    -
chr9    21994789    22077889    NR_047539   AAR2    +
chr16   20370491    20416033    NM_174924   AARD    -
chr1    210111518   210337633   NM_001146262    AARS    +

I've tried adopting a couple other posts on stackoverflow but keep getting the same output of just 1 match even though multiple others exist:

awk to match file1 with file2 and output matches

awk -F '\t' 'NR==FNR{c[$1]; next} ($5 in c)' file1 file2

Find rows with the same value in a column in two files

awk -F '\t' 'NR==FNR{A[$1];next}$5 in A' file1 file2

Comparing two columns in two files using awk

awk -F '\t' 'FNR==NR {a[$1];next} {for (i in a) if ($5~i) print}' file1 file2

Can someone help me understand where my awk line is going wrong?

I'm currently using GNU Awk 4.2.1, API: 2.0

2
most likely your file has additional tabs or tab space combinations altering field counter. If your fields do not include space char, no need to specify the field delimiter. Please run the same code without -F'\t'karakfa
Running without -F '\t' produced the same result but your comment made me think that maybe there was something wrong with the files themselves. I used cat -v to check for any other tabs/space combinations and saw that at the end of file1 there was ^M at the end of every row in file1 except for AARS (since I had opened the file in excel to save as txt file). So I used dos2unix to remove the ^M awk works as expected with -F '\t' removed as you suggested. Thanks so much that was driving me nuts!cutts
Whee, I love it when the solution turns out to be something totally unmentioned in an otherwise good-looking question. :-) Next time, as a sanity check, try going through your question as someone answering it would -- copying your sample input from SO to temp files and running the code on it.ghoti

2 Answers

0
votes

I think you almost had it... I did check -- and the following works for me.

awk -F '\t' 'BEGIN { split("", a) } NR == FNR { a[$0] = 1; next } $5 in a' file1 file2

Result (the output in your example is not consistent with file2):

chr22   38035683    38052050    NM_018957   AAMDC   +
chrX    47001614    47004609    NM_019056   AAMP    -
chr9    21994789    22077889    NR_047539   AAR2    +
chr16   20370491    20416033    NM_174924   AARD    -
chr1    210111518   210337633   NM_001146262    AARS    +

It is possible that there is something wrong with the FS, I suppose... You could try to make FS hit on both space and tab & also attempt to remove any invisible and non-printable characters from the mess?

awk -F '[[:blank:]]+' '
    BEGIN { split("", a) }
    NR == FNR {
        gsub(/[^[:graph:]]/, "")
        a[$0] = 1
        next
    }
    {
        p = $5
        gsub(/[^[:graph:]]/, "", p)
    }
    p in a' file1 file2
0
votes

Read from file1 into an associative array. Then read file2 and for each line looking for a match.

$ awk 'BEGIN{while(getline line<"file1") {x[line]="1";}} {for (i=1;i<=NF;i++) if(x[$i]) {print; break} }' file2
chr22   38035683    38052050    NM_018957   AAMDC   +
chrX    47001614    47004609    NM_019056   AAMP    -
chr9    21994789    22077889    NR_047539   AAR2    +
chr16   20370491    20416033    NM_174924   AARD    -
chr1    210111518   210337633   NM_001146262    AARS    +