Perl: Removing duplicates from a large set of data

Question

I'm using Perl to generate a list of unique exons (which are the units of genes).

I've generated a file in this format (with hundreds of thousands of lines):

chr1 1000 2000 gene1

chr1 3000 4000 gene2

chr1 5000 6000 gene3

chr1 1000 2000 gene4

Position 1 is the chromosome, position 2 is the starting coordinate of the exon, position 3 is the ending coordinate of the exon, and position 4 is the gene name.

Because genes are often constructed of different arrangements of exons, you have the same exon in multiple genes (see the first and fourth sets). I want to remove these "duplicate" - ie, delete gene1 or gene4 (not important which one gets removed).

I've bashed my head against the wall for hours trying to do what (I think) is a simple task. Could anyone point me in the right direction(s)? I know people often use hashes to remove duplicate elements, but these aren't exactly duplicates (since the gene names are different). It's important that I don't lose the gene name, also. Otherwise this would be simpler.

Here's a totally non-functional loop I've tried. The "exons" array has each line stored as a scalar, hence the subroutine. Don't laugh. I know it doesn't work but at least you can see (I hope) what I'm trying to do:

for (my $i = 0; $i < scalar @exons; $i++) {
my @temp_line = line_splitter($exons[$i]);                      # runs subroutine turning scalar into array
for (my $j = 0; $j < scalar @exons_dup; $j++) {
    my @inner_temp_line = line_splitter($exons_dup[$j]);        # runs subroutine turning scalar into array
    unless (($temp_line[1] == $inner_temp_line[1]) &&           # this loop ensures that the the loop
            ($temp_line[3] eq $inner_temp_line[3])) {           # below skips the identical lines
                if (($temp_line[1] == $inner_temp_line[1]) &&   # if the coordinates are the same
                    ($temp_line[2] == $inner_temp_line[2])) {   # between the comparisons
                        splice(@exons, $i, 1);                  # delete the first one
                    }
            }
}

}

dalton dalton · Accepted Answer · 2011-04-18T21:27:34

my @exons = (
    'chr1 1000 2000 gene1',
    'chr1 3000 4000 gene2',
    'chr1 5000 6000 gene3',
    'chr1 1000 2000 gene4'
);

my %unique_exons = map { 
    my ($chro, $scoor, $ecoor, $gene) = (split(/\s+/, $_));
    "$chro $scoor $ecoor" => $gene
} @exons;

print "$_ $unique_exons{$_} \n" for keys %unique_exons;

This will give you uniqueness, and the last gene name will be included. This results in:

chr1 1000 2000 gene4 
chr1 5000 6000 gene3 
chr1 3000 4000 gene2

Perl: Removing duplicates from a large set of data

4 Answers