6
votes

I'm using Perl to generate a list of unique exons (which are the units of genes).

I've generated a file in this format (with hundreds of thousands of lines):

chr1 1000 2000 gene1

chr1 3000 4000 gene2

chr1 5000 6000 gene3

chr1 1000 2000 gene4

Position 1 is the chromosome, position 2 is the starting coordinate of the exon, position 3 is the ending coordinate of the exon, and position 4 is the gene name.

Because genes are often constructed of different arrangements of exons, you have the same exon in multiple genes (see the first and fourth sets). I want to remove these "duplicate" - ie, delete gene1 or gene4 (not important which one gets removed).

I've bashed my head against the wall for hours trying to do what (I think) is a simple task. Could anyone point me in the right direction(s)? I know people often use hashes to remove duplicate elements, but these aren't exactly duplicates (since the gene names are different). It's important that I don't lose the gene name, also. Otherwise this would be simpler.

Here's a totally non-functional loop I've tried. The "exons" array has each line stored as a scalar, hence the subroutine. Don't laugh. I know it doesn't work but at least you can see (I hope) what I'm trying to do:

for (my $i = 0; $i < scalar @exons; $i++) {
my @temp_line = line_splitter($exons[$i]);                      # runs subroutine turning scalar into array
for (my $j = 0; $j < scalar @exons_dup; $j++) {
    my @inner_temp_line = line_splitter($exons_dup[$j]);        # runs subroutine turning scalar into array
    unless (($temp_line[1] == $inner_temp_line[1]) &&           # this loop ensures that the the loop
            ($temp_line[3] eq $inner_temp_line[3])) {           # below skips the identical lines
                if (($temp_line[1] == $inner_temp_line[1]) &&   # if the coordinates are the same
                    ($temp_line[2] == $inner_temp_line[2])) {   # between the comparisons
                        splice(@exons, $i, 1);                  # delete the first one
                    }
            }
}

}

4

4 Answers

7
votes
my @exons = (
    'chr1 1000 2000 gene1',
    'chr1 3000 4000 gene2',
    'chr1 5000 6000 gene3',
    'chr1 1000 2000 gene4'
);

my %unique_exons = map { 
    my ($chro, $scoor, $ecoor, $gene) = (split(/\s+/, $_));
    "$chro $scoor $ecoor" => $gene
} @exons;

print "$_ $unique_exons{$_} \n" for keys %unique_exons;

This will give you uniqueness, and the last gene name will be included. This results in:

chr1 1000 2000 gene4 
chr1 5000 6000 gene3 
chr1 3000 4000 gene2
3
votes

You can use a hash to dedup en passant, but you need a way to join the parts you want to use to detect duplicates into a single string.

sub extract_dup_check_string {
    my $exon = shift;
    my @parts = line_splitter($exon);
    # modify to suit:
    my $dup_check_string = join( ';', @parts[0..2] );
    return $dup_check_string;
}

my %seen;
@deduped_exons = grep !$seen{ extract_dup_check_string($_) }++, @exons;
1
votes

You can use a hash to keep track of duplicates you've already seen and then skip them. This example assumes the fields in your input file are space-delimited:

#!/usr/bin/env perl                                                                                                                                                                                                                   

use strict;
use warnings;

my %seen;
while (my $line = <>) {

  my($chromosome, $exon_start, $exon_end, $gene) = split /\s+/, $line;
  my $key = join ':', $chromosome, $exon_start, $exon_end;

  if ($seen{$key}) {
    next;
  }
  else {
    $seen{$key}++;
    print $line;
  }

}
0
votes

As simple as it comes. I tried to use as little magic as possible.

my %exoms = ();
my $input;

open( $input, '<', "lines.in" ) or die $!;

while( <$input> )
{
    if( $_ =~ /^(\w+\s+){3}(\w+)$/ ) #ignore lines that are not in expected format
    {
        my @splits = split( /\s+/, $_ ); #split line in $_ on multiple spaces

        my $key = $splits[1] . '_' . $splits[2];

        if( !exists( $exoms{$key} ) )
        {
            #could output or write to a new file here, probably output to a file
            #for large sets.
            $exoms{$key} = \@splits; 
        }
    }

}


#demo to show what was parsed from demo input
while( my ($key, $value) = each(%exoms) )
{
    my @splits = @{$value};
    foreach my $position (@splits)
    {
        print( "$position " );
    }
    print( "\n" );

}