1
votes

File 1:

col1    col2     col3   col4   col5    col6     col7    col8       

chr1    1361651 1361652 1       3       0       0       1
chr1    1421915 1421916 1       1       1       0       0
chr1    3329147 3329148 2       2,3     0       1       1
chr1    8421092 8421093 3       1,2,3   1       1       1
chr1    13802362        13802363        3       1,2,3   1       1       1
chr1    43315088        43315089        2       1,2     1       1       0
chr1    52256664        52256665        2       1,3     1       0       1

File 2 :

col1      col2       col3     col4     col5    col6     col7.....col16    

chr1    1361651     1361652    G       data5   data6    data7....data16
chr1    2468066     2468067   G       data5   data6    data7....data16
chr1    3329147     3329148   ........
chr1    8421092     8421093   ........ 
chr1    13802362    13802363   ........        
chr1    43315088    43315089   ........        
chr1    52256664    52256665   ........

Output.txt

check the column 5 of file 1 has 1,2,3 then compare the column 1 and column 2 betwen the file 1 and two and print the match in seperate file

col1      col2       col3     col4     col5    col6     col7.....col16

chr1    8421092     8421093   ........ 
chr1    13802362    13802363   ........        

my code helps me to compare two files, but first i need to compare with in the file and then across the file.

my $file1 = $ARGV[0];
my $file2 = $ARGV[1];
open(FILE1, $file1);
open(FILE2, $file2);
open my $f, '>', "output.txt" or die "Cannot open output.txt: $!";
my @arr1=<FILE1>;
my @arr2=<FILE2>;
close FILE1;
close FILE2;
for (@arr1)
{
    chomp;
    my($hit1,$hit2,$hit3,$hit4,$hit5,$rest)=split(/\t/);
    my $ckey="$hit1\_$hit2";
    $chash{$ckey}=1;
}
for (@arr2)
{
    chomp;
    my($val1,$val2,$val3,$val4,$val5,$rest)=split(/\t/);
    my $ckey="$val1\_$val2";
    $chash{$ckey}++;
    if( $chash{$ckey} == 2 )
    {
    # this key has been seen in both previous files
    print $f "$_\n";
    }
}
2
Questions: 1) finding '1,2,3' is a condition -- if not found don't bother, is this correct? 2) column 1 in both files are given the same in the example ... are they different? 3) once the match is found, are the whole rows (of the match) guarranted to be equal? 4) does the order of lines to write out matter?zdim
@zdim, Questions: 1) correct 2) they are different 3) if the match is found, print the file 2 row fully 4) order is doesn't matteruser2767090
Can there be (legitimate) empty fields in your files? If the answer is 'no' you can use the much simpler version in my answer. If the answer is 'no, never' I should probably remove the part with checks etc. If the data may be messier (various fields missing) one should have more checks. Please let me know.zdim

2 Answers

2
votes

The shown code is a bit too complicated. Also, it is not clear how the hashes will do if any words between files happen to be the same. Further, one also needs to keep the whole lines and coordinate that with places of match. You'd need extra data structures. Here is a simpler approach.

Join the first two fields on each line and put that string on an array. While passing through file1 also check for the condition, if not found exit. Form and store same strings for file2, also storing whole lines. Then iterate through index of either array and when strings match select the corresponding line of file2 (per requirement). These lines are our output. Code can be simpler, see Notes.

use warnings;
use strict;

my $patt = '1,2,3';

# Join cols 1,2 into a string, store; check condition
open my $fh1, '<', 'file1.txt';
my @f1;
my $go = 0;
while (my $line = <$fh1>) {
    next if $line =~ /^\s*$/;
    my @cols = split '\s+', $line;
    my ($c1, $c2) = @cols[0,1];
    next if not $c1 or not $c2;
    push @f1, join '_', $c1, $c2;
    $go = 1 if $cols[4] and $patt eq $cols[4];
}
close $fh1;

if (not $go) {
    print "Condition not satisfied, exiting.\n";    
    exit 0;
}

# Join cols 1, 2 from file2, store; store lines
my (@f2, @lines);
open my $fh2, '<', 'file2.txt';
while (<$fh2>) {
    next if /^\s*$/;
    my ($c1, $c2) = (split)[0,1];
    next if not $c1 or not $c2;
    push @f2, join('_', $c1, $c2);
    push @lines, $_;
}
close $fh2;

# Find matches: compare strings from arrays
# Print corresponding lines file2
my @output;
foreach my $i (0..$#f2) {
    push(@output, $lines[$i]) if $f1[$i] eq $f2[$i];
} 
print "$_\n" for @output;

Note. By the problem description most lines of two sample files match, having equal first two fields. The shown expected output disagrees with this but the description is fairly explicit.

With extra empty lines removed by hand for space, this prints

col1       col2        col3   col4      col5    col6     col7    col8       
chr1    1361651     1361652    G       data5   data6    data7....data16
chr1    3329147     3329148   ........
chr1    8421092     8421093   ........ 
chr1    13802362    13802363   ........        
chr1    43315088    43315089   ........        
chr1    52256664    52256665   ........

Notes. For mere comparison fields can be just joined; having a recognizible sequence (just _ here) allows us to restore them if needed. Some reasonable assumptions are clearly made: files are of same lengths, with same structure (same columns missing). If they don't hold it is easy to adjust this step-by-step processing. While reading the file we guard against: either of first two fields missing, missing fourth column. If this is surely not needed

while (<$fh1>) {
    next if /^\s*$/;
    my ($c1, $c2, $c4) = (split)[0,1,4];
    push @f1, join '_', $c1, $c2;
    $go = 1 if $patt eq $cols[4];
}
exit if not $go;
while (<$fh2>) {
    next if /^\s*$/;
    push @f2, join '_', (split)[0,1];
    push @lines, $_;
}
@output = map { $lines[$_] } grep { $f1[$_] eq $f2[$_] } (0..$#f2);
1
votes

Your description is a little ambiguous - setting aside the check for "1,2,3" for a moment - your description talks of comparing columns 1 and 2 but column 1 has the same thing on every line in both files - i.e. "chr". As you've highlighted the numbers in columns 2 & 3 and as they appear in the "Output.txt" file, I presume you mean those two columns not 1 and 2 - that's the basis I'm proceeding on.

Before moving onto the solution, I just want to highlight a couple of problems with your existing code - firstly, you are string concatenating the two columns. What if columns 2 & 3 have "46" & "123" respectively in one file; and in the other its "461" & "23", then your concat is going to give you a false match. Now maybe, that just "ain't going to happen" and if you know your data that well, then fair enough - but you need to be aware of the possibility.

More importantly, the hash keeping track of the numbers previously seen is insufficient for the task you need of it - what happens if there are two lines with the same content in columns 2 & 3 in the same file? What happens if there are two lines the same in one file, and one line the same in the other file, giving a total of 3 but your only looking for a tally of 2?. Again, you may know that these combinations are not going to show up in your data but you need to be aware of the lurking bug.

One other thing - it's not clear (to me, at least) if the match of columns 2 & 3 have to be on the same line of each file respectively. In your test data, columns 2 & 3 on lines 4 & 5 are matching lines 4 & 5 respectively in the other file - is that necessary? Or, (again, setting aside the "1,2,3" thing for a minute) can columns 2 & 3 on line 4 of the first file happily match columns 2 & 3 of line 7 in the second?

I don't mean to be difficult here but obviously these things are very relevant to finding the right solution.

If you want the minimalist change to your existing code because none of these things I'm pointing out are going to matter, all you need to do is "bail out" of the first loop unless "1,2,3" is in column 5, that is $arr1[4] or - after the split - $hit5. Well, just add exactly that;

chomp;
my($hit1,$hit2,$hit3,$hit4,$hit5,$rest)=split(/\t/);
next unless $hit4 eq "1,2,3";   # <-- Added line
my $ckey="$hit1$hit2";
$chash{$ckey}=1; 

'next' immediately terminates the current loop run, so $chash will not get updated with the contents of columns 2 & 3 - but, I have to repeat, the end result is pretty precarious code.

Here is an alternative implementation:

#!/usr/bin/env perl
use v5.12;

my $file1 = $ARGV[0];
my $file2 = $ARGV[1];
open(FILE1, $file1) or die "$file1: $!\n";
open(FILE2, $file2) or die "$file2: $!\n";
open my $f, '>', "output.txt" or die "Cannot open output.txt: $!";

my @arr1 = map [split(" ", $_)], <FILE1>;
my @arr2 = map [split(" ", $_)], <FILE2>;
close FILE1;
close FILE2;

my $i = 0;
for my $arr1row (@arr1) {
    # Grab the same row in file 2
    my $arr2row = $arr2[$i++] ;

    # bail unless we have "1,2,3" in col 5
    next unless $arr1row->[4] eq "1,2,3" ;

    # bail if we dont have a line from file 2 because its shorter
    next unless defined $arr2row ;

    # If col2 and col3 are the same from each file ...
    if ($arr1row->[1] == $arr2row->[1] &&
        $arr1row->[2] == $arr2row->[2] )  {

        # print out all fields from file 2
        say $f join("\t", @$arr2row);
    }
}