compare within the file and then between the files

Question

File 1:

col1    col2     col3   col4   col5    col6     col7    col8       

chr1    1361651 1361652 1       3       0       0       1
chr1    1421915 1421916 1       1       1       0       0
chr1    3329147 3329148 2       2,3     0       1       1
chr1    8421092 8421093 3       1,2,3   1       1       1
chr1    13802362        13802363        3       1,2,3   1       1       1
chr1    43315088        43315089        2       1,2     1       1       0
chr1    52256664        52256665        2       1,3     1       0       1

File 2 :

col1      col2       col3     col4     col5    col6     col7.....col16    

chr1    1361651     1361652    G       data5   data6    data7....data16
chr1    2468066     2468067   G       data5   data6    data7....data16
chr1    3329147     3329148   ........
chr1    8421092     8421093   ........ 
chr1    13802362    13802363   ........        
chr1    43315088    43315089   ........        
chr1    52256664    52256665   ........

Output.txt

check the column 5 of file 1 has 1,2,3 then compare the column 1 and column 2 betwen the file 1 and two and print the match in seperate file

col1      col2       col3     col4     col5    col6     col7.....col16

chr1    8421092     8421093   ........ 
chr1    13802362    13802363   ........

my code helps me to compare two files, but first i need to compare with in the file and then across the file.

my $file1 = $ARGV[0];
my $file2 = $ARGV[1];
open(FILE1, $file1);
open(FILE2, $file2);
open my $f, '>', "output.txt" or die "Cannot open output.txt: $!";
my @arr1=<FILE1>;
my @arr2=<FILE2>;
close FILE1;
close FILE2;
for (@arr1)
{
    chomp;
    my($hit1,$hit2,$hit3,$hit4,$hit5,$rest)=split(/\t/);
    my $ckey="$hit1\_$hit2";
    $chash{$ckey}=1;
}
for (@arr2)
{
    chomp;
    my($val1,$val2,$val3,$val4,$val5,$rest)=split(/\t/);
    my $ckey="$val1\_$val2";
    $chash{$ckey}++;
    if( $chash{$ckey} == 2 )
    {
    # this key has been seen in both previous files
    print $f "$_\n";
    }
}

Questions: 1) finding '1,2,3' is a condition -- if not found don't bother, is this correct? 2) column 1 in both files are given the same in the example ... are they different? 3) once the match is found, are the whole rows (of the match) guarranted to be equal? 4) does the order of lines to write out matter? — zdim
@zdim, Questions: 1) correct 2) they are different 3) if the match is found, print the file 2 row fully 4) order is doesn't matter — user2767090
Can there be (legitimate) empty fields in your files? If the answer is 'no' you can use the much simpler version in my answer. If the answer is 'no, never' I should probably remove the part with checks etc. If the data may be messier (various fields missing) one should have more checks. Please let me know. — zdim

zdim zdim · Accepted Answer · 2016-03-02T08:42:24

The shown code is a bit too complicated. Also, it is not clear how the hashes will do if any words between files happen to be the same. Further, one also needs to keep the whole lines and coordinate that with places of match. You'd need extra data structures. Here is a simpler approach.

Join the first two fields on each line and put that string on an array. While passing through file1 also check for the condition, if not found exit. Form and store same strings for file2, also storing whole lines. Then iterate through index of either array and when strings match select the corresponding line of file2 (per requirement). These lines are our output. Code can be simpler, see Notes.

use warnings;
use strict;

my $patt = '1,2,3';

# Join cols 1,2 into a string, store; check condition
open my $fh1, '<', 'file1.txt';
my @f1;
my $go = 0;
while (my $line = <$fh1>) {
    next if $line =~ /^\s*$/;
    my @cols = split '\s+', $line;
    my ($c1, $c2) = @cols[0,1];
    next if not $c1 or not $c2;
    push @f1, join '_', $c1, $c2;
    $go = 1 if $cols[4] and $patt eq $cols[4];
}
close $fh1;

if (not $go) {
    print "Condition not satisfied, exiting.\n";    
    exit 0;
}

# Join cols 1, 2 from file2, store; store lines
my (@f2, @lines);
open my $fh2, '<', 'file2.txt';
while (<$fh2>) {
    next if /^\s*$/;
    my ($c1, $c2) = (split)[0,1];
    next if not $c1 or not $c2;
    push @f2, join('_', $c1, $c2);
    push @lines, $_;
}
close $fh2;

# Find matches: compare strings from arrays
# Print corresponding lines file2
my @output;
foreach my $i (0..$#f2) {
    push(@output, $lines[$i]) if $f1[$i] eq $f2[$i];
} 
print "$_\n" for @output;

Note. By the problem description most lines of two sample files match, having equal first two fields. The shown expected output disagrees with this but the description is fairly explicit.

With extra empty lines removed by hand for space, this prints

col1       col2        col3   col4      col5    col6     col7    col8       
chr1    1361651     1361652    G       data5   data6    data7....data16
chr1    3329147     3329148   ........
chr1    8421092     8421093   ........ 
chr1    13802362    13802363   ........        
chr1    43315088    43315089   ........        
chr1    52256664    52256665   ........

Notes. For mere comparison fields can be just joined; having a recognizible sequence (just _ here) allows us to restore them if needed. Some reasonable assumptions are clearly made: files are of same lengths, with same structure (same columns missing). If they don't hold it is easy to adjust this step-by-step processing. While reading the file we guard against: either of first two fields missing, missing fourth column. If this is surely not needed

while (<$fh1>) {
    next if /^\s*$/;
    my ($c1, $c2, $c4) = (split)[0,1,4];
    push @f1, join '_', $c1, $c2;
    $go = 1 if $patt eq $cols[4];
}
exit if not $go;
while (<$fh2>) {
    next if /^\s*$/;
    push @f2, join '_', (split)[0,1];
    push @lines, $_;
}
@output = map { $lines[$_] } grep { $f1[$_] eq $f2[$_] } (0..$#f2);

compare within the file and then between the files

2 Answers