0
votes

I have two very large files in Unix each containing say 5 columns but millions of lines.

Ex :

File 1: abc|def|ghk|ijk|lmn .... ...

File2 : abc|def|ghk|ijk|123 ... ...

My task is to compare the two large files and find out the differing columns and rows . For instance the output would be : Column-no Row-no File1-word File2-word.

Ex :

5 1 lmn 123

The files are really large . I have heard awk is the fastest way to do file parsing in Unix. Since the output can't wait for too long.

Can this be done using awk?

1
Yes, it can be done with awk — though reading from two files concurrently is hard but saving all the input from one file and then using that while reading the second is a normal mode of operation for awk scripts. What did you try, and where did you run into problems? If you can use Perl or Python, you'd find it easier to avoid slurping the whole of one file into memory.Jonathan Leffler
Even if I use Perl I atleast have to slurp one file in memory right ? And then use that data structure to compare the second fileSubhayan Bhattacharya
No; using Perl, you'd read one line from file 1 and one line from file 2 and then compare those lines, and print the differences; rinse and repeat.Jonathan Leffler
@JonathanLeffler obviously you can do that in awk too with a getline < file2 for every line read from file1. I'm not saying that's the best approach of course, just that it's do-able. Subhayan - edit your question to include concise, testable sample input (e.g. a couple of files of 4 or 5 rows and 4 or 5 columns each) and expected output.Ed Morton

1 Answers

3
votes

paste/awk solution

$ paste -d'|' file1 file2 | 
  awk -F'|' '{w=NF/2; 
              for(i=1;i<=w;i++) 
                 if($i!=$(i+w)) printf "%d %d %s %s", NR,i,$i,$(i+w); 
              print ""}'

1 5 lmn 123

I changed the order, it makes more sense to me to print the line number first then field number, however you can change it easily...

Once paste matches lines from two files go over field of the first half (first file) and compare with the second half (second file) and print the differences. awk has the implicit loop to over all records (lines). I haven't tested this with large files but for awk part it doesn't matter (record by record). I'm not sure how eager paste is but I doubt it will blink.