Compare two columns of two files and count the differences

1

votes

I have two files, tab separated, where I want to compare line by line the values of column 1 of file1 with column 1 of file2 and so forth until n columns.

The comparisons are to count the differences.

Values in columns can be either 0, 1 or 2, for example:

File1:

col1 col2 col3 col4
1 1 1 2
1 1 1 2
2 1 2 2
2 1 2 2

File2:
col1 col2 col3 col4
1 1 1 1
1 1 0 1
0 1 0 1
1 0 1 0

Results
2 1 3 4

So, col1 of file1 and file2 with 2 differences, col2 off file1 and file2 with 1 difference and so forth... I have seen many similar questions in AWK but the majority of them is to compare columns and append a column from either files if matches or not, but not count differences.

I believe the comparison of not match from two columns would start with something like this, but from there I am totally lost...

awk 'NR==FNR { a[$1]!=$1; next}

Thanks

awkcomparisonmultiple-columns

col1 of file1 and file2 with 2 differences: Are you sure about this? It is same in both files i.e. 1 1 2 2 values? – anubhava

@anubhava you are right. I fixed the example. – guidebortoli

3

votes

You may use this awk:

awk 'BEGIN{FS=OFS="\t"} FNR == NR {for (i=1; i<=NF; ++i) a[i,FNR] = $i; next} FNR > 1 {for (i=1; i<=NF; ++i) if ($i != a[i,FNR]) ++out[i]; ncol=NF} END {print "Results"; for (i=1; i <= ncol; ++i) printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)}' f2 f1

Results
2   1   3   4

A more readable form:

awk 'BEGIN {FS=OFS="\t"}
FNR == NR {
   for (i=1; i<=NF; ++i)
      a[i,FNR] = $i
   next
}
FNR > 1 {
   for (i=1; i<=NF; ++i)
      if ($i != a[i,FNR])
         ++out[i]
}
END {
   print "Results"
   for (i=1; i <= NF; ++i)
      printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)
}' f2 f1

3

votes

If you have paste available you can do this without storing anything in an array except the output

paste File1 File2 |
awk '
    NR > 1 {
        mid = NF/2
        for (i=1; i<=mid; i++) {
            count[i] += ( $i == $(mid+i) ? 0 : 1 )
        }
    }
    END {
        for (i=1; i<=mid; i++) {
            printf "%d%s", count[i], (i<mid ? OFS : ORS)
        }
    }
'

Output:

2 1 3 4

1

votes

With getline:

$ cat foo.awk
NR == 1 { n = NF; }
{
  if(NF != n) { print "error"; exit 1; }
  for(i = 1; i <= n; i++) a[i] = $i;
  if(getline < f != 1 || NF != n) { print "error"; exit 1; }
  for(i = 1; i <= NF; i++) if($i && a[i] != $i) c[i] += 1;
}
END {
  for(i = 1; i <= n; i++) printf("%d%c", c[i], (i == n) ? "\n" : " ");
}

$ awk -v f=File1 -f foo.awk File2
2 1 3 4

Explanation:

Variable f holds the name of the first file, we pass it to awk with the -v f=File1 option and we pass the second file name (File2) to awk as the file to process.
We set n (number of fields) from the first line of the second file. Later, if we encounter a line with a different number of fields in one of the two files we exit with an error message.
We fill array a with the fields from the current line.
Then we read the next line form the first file with getline, which sets the current fields with the new values. We exit with an error message if getline fails.
We compare the fields with array a and increment elements of array c if a difference is found.
At the end we print array c.

Note: some awk experts advocate against getline. If you prefer avoiding it too, prefer the solutions that pass File1 and File2 to awk and store the content of the first one in an array. But if your files are large remember that you could encounter memory issues, while the getline-based solution could process billions of lines of hundreds of fields without any problem (but would you use awk in this case?).

1

votes

As the values in the fields are single chars (0,1,2), we exclude the headers and pack field values to field number indexed strings without delimiters (for example a[1]="1122") and use substr() for extracting char for comparing ($i!=substr(a[i],FNR-1,1)):

awk '
NR==FNR && NR>1 {                         # process first file, ignore header
    for(i=1;i<=NF;i++)                    # since column values are 1 digit only
        a[i]=a[i] $i                      # just catenate themem, no separators
    next
}
FNR>1 {                                   # process second file
    for(i=1;i<=NF;i++)
        r[i]+=($i!=substr(a[i],FNR-1,1))  # compare field data and count mismatches
}
END {                                     # in the end
    for(i=1;(i in r);i++)                 # loop and ...
        printf "%s%s",(i==1?"":OFS),r[i]  # output
    print ""
}' file1 file2

Output:

2 1 3 4

Notice: This only works for single char values, as requested in the OP.

Compare two columns of two files and count the differences

4 Answers