1
votes

I have a data of 314 files (names like file1 fil2 file3 ......). Each file have two columns and different rows.

Example Input file1

a 19
b 9
c 8
i 7
g 6
d 5

Example Input file2

a 19
i 7
g 6
d 5

I have an another file (data.txt) having 314 rows and each row have different number of columns

a d c g
a i
a d
d c

I want to compare Column 1 of file1 with the 1st row of data.txt file and simlarly Column 1 of file2 with the 2nd row of data.txt. and so on till column 1 of file314 with 314th row of the data.txt file.

My expected output is numbers of genes matched and mismatched for particular file and for particular row.

I am able to do it only with separate-separate files. How to do it i single command.

Expected output

                         Matched   Mismatched     
Ist_file_1st row        4          2
2nd_file_2nd row        2          2
.
.
314_file_314th row      -          -
1
You mean each rows all columns should be matched with each row's columns? Or only 1st column should be matched?RavinderSingh13
only 1st column.Ravi Saroch
1 more clarification should Ist_column_1st row be Ist_file_1st row? AM I right here?RavinderSingh13
Yes. I corrected this.Ravi Saroch

1 Answers

3
votes

The easiest way is the following:

awk '(FNR==NR){$1=$1; a[FNR]=OFS $0 OFS; next}
     f && (FNR==1) { print f,m,nr-m }
     (FNR==1){f++; nr=m=0}
     {nr++; if(a[f] ~ OFS $1 OFS) m++ }
     END { print f,m,nr-m }' data.txt f1.txt f2.txt ... f314.txt

For the data.txt and f1.txt and f2.txt mentioned in the OP, the following output is produced:

1 4 2
2 2 2

The first column represents the file number/row, the second column represents the total matches and the third the total mismatches.