Comparing two files using awk and printing contains which are matching from other files

Question

I have two files:

file1.txt

919167,hutch,mumbai
919594,idea,mumbai

file2.txt

919167000000
919594000000

Output

919167000000,hutch,mumbai
919594000000,idea,mumbai

How can I achieve this using AWK? I've got a huge file of phone numbers which needs to be compared like this. I believe Awk can handle it; if not please let me know how can I do this.

Extra definitions

Is the common part always a 6-digit number? Yes always 6.
Are the two files already sorted? file1 is not sorted. file2 can be sorted.
Are the trailing digits in file 2 always zeros? No, these are phone numbers this can vary, purpose of this is to get series information of the phone number.
Is there any danger of file 1 containing three records for a given number while file 2 contains 2 records, or is it one-to-one? It's one-to-one. Can there be records in file 1 with no match in file 2, or vice versa?_ Yes.
If so, do you want to see the unmatched records? Yes I want both records.

Extended data

file1.txt

919167,hutch,mumbai
919594,idea,mumbai
918888,airtel,karnataka

file2.txt

919167838888
919594998484
919212334323

Output Expected:

919167838888,hutch,mumbai
919594998484,idea,mumbai
919212334323,nomatch,nomatch

Is the common part always a 6-digit number? Are the two files already sorted? The data shown is sorted. Are the trailing digits in file 2 always zeros? Is there any danger of file 1 containing three records for a given number while file 2 contains 2 records, or is it one-to-one? Can there be records in file 1 with no match in file 2, or vice versa? If so, do you want to see the unmatched records? — Jonathan Leffler

Jonathan Leffler Jonathan Leffler · Accepted Answer · 2012-10-04T05:50:42

As I noted in a comment, there's a lot of unstated information needed to give a definitive answer. However, we can make some plausible guesses:

The common number is the first 6 digits of file 2 (we don't care about the trailing digits, but will simply copy them to the output).
The files are sorted in order.
If there are unmatched records in either file, those records will be ignored.

The tools of choice are probably sed and join:

sed 's/^\([0-9]\{6\}\)/\1,\1/' file2.txt |
join -t, -o 1.2,2.2,2.3 - file1.txt

This edits file2.txt to create a comma-separated first field with the 6-digit phone number followed by all the rest of the line. The input is fed to the join command, which joins on the first column, and outputs the 'rest of the line' (column 2) from file2.txt and columns 2 and 3 from file1.txt.

If the phone numbers are variable length, then the matching operation is horribly complex. For that, I'd drop into Perl (or Python) to do the work. If the data is unsorted, it can be sorted before being fed into the commands. If you want unmatched records, you can specify how to handle those in the options to join.

The extra information needed is now available. The key information is the 6-digits is fixed — phew! Since you're on Linux, I'm assuming bash is available with 'process substitution':

sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -a 2 -e 'no-match' - <(sort file1.txt)

If process substitution is not available, simply sort file1.txt in situ:

sort -o file1.txt file1.txt

Then use file1.txt in place of <(sort file1.txt).

I think the comment might be asking for inputs such as:

file1.txt

919167,hutch,mumbai
919594,idea,mumbai
902130,airtel,karnataka

file2.txt

919167000000
919594000000
919342313242

Output

no-match,airtel,karnataka
919167000000,hutch,mumbai
919342313242,no-match,no-match
919594000000,idea,mumbai

If that's not what the comment is about, please clarify by editing the question to add the extra data and output in a more readable format than comments allow.

Working with the extended data, this mildly modified command:

sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -e 'no-match' - <(sort file1.txt)

produces the output:

919167838888,hutch,mumbai
919212334323,no-match,no-match
919594998484,idea,mumbai

which looks rather like a sorted version of the desired output. The -a n options control whether the unmatched records from file 1 or file 2 (or both) are printed; the -e option controls the value printed for the unmatched fields. All of this is readily available from the man pages for join, of course.

Comparing two files using awk and printing contains which are matching from other files

Extra definitions

Extended data

3 Answers

file1.txt

file2.txt

Output