1
votes

Good afternoon. I am writing some keys and values into a %hash, but I keep getting an undef value that I can't seem to explain.

my @maxent_unchanged = <FILE1>; 
close FILE1;
chomp (@maxent_unchanged);

my @NM;
my @max_score_unchanged;
foreach my $line(@maxent_unchanged) {

  if ($line =~ m/[a-z]/i) {
    push (@NM, $line);
  }
  else { 
    push (@max_score_unchanged, $line);
  }
}

my %max_unchanged;
my $i = 0;
foreach my $lines(@maxent_unchanged) {
  $max_unchanged{$NM[$i]} = $max_score_unchanged[$i]; ##maxent score for unchanged seq
  $i++;
}

To put into context, @maxent_unchanged alternates between @NM and @max_score_unchanged like this:

$VAR1 = 'TTAAGGCAGCCCACCCGCAGGCT        >       1       110740688       110740688       C       T       GCCTGGGCGGGGAGGGCTGTCACAGTGCCGGCAGCAGCCCTTAAGGCAGC[C]CACCCGCAGGCTGCCGAGCGCTACCTGTATTTCCCCAACTGGGCCATGGC splicing  splicing        SLC6A17:NM_001010898:exon12:c.1816-10C>T';
$VAR2 = '0.77';
$VAR3 = 'TTCTATCCTTTGTTTTACAGGAA        >       1       111857154       111857154       T       C       TTAAATGGAGGGAGTCCTGACTTTTGAAGTTTATCTGTTTCTATCCTTTG[T]TTTACAGGAACAGCCAGCTGAAAACTCTCCTGGCCATTGGAGGCTGGAAC splicing  splicing        CHIA:NM_201653:exon5:c.258-8T>C';
$VAR4 = '10.99';

Therefore it (@maxent_unchanged) has twice the number of lines of @NM and @max_score_unchanged. I have checked this and it holds true.

If I data dump @NM and @max_score_unchanged I get the same number of variables, but when I put these into a %hash, I get an extra key-value pair as shown by data dumping the hash.

$VAR1 = '';
$VAR2 = undef;
$VAR3 = 'TTTTATTAATTCCTTTGTAGAAC        >       6       144835040       144835040       T       C       TATCATCTTAAATATTTCATATGGTTATGTAAGCATTTTATTAATTCCTT[T]GTAGAACCATCAGAACCAGCTAGAAATATTTGATGGGAACGTGGCTCACA splicing  splicing        UTRN:NM_007124:exon35:c.4945-5T>C';
$VAR4 = '8.22';
$VAR5 = 'TCTTTTTTGGACATGTACAGAGC        >       10      97127462        97127462        C       A       AGGAGTCTCTGAAGAAATTTCCGGAGTAGGGCTGATGGCTGAGCTCTGTA[C]ATGTCCAAAAAAGAAAAAAAAGAAGAAAAAAATAATGTAGATGATTTATT splicing  splicing        SORBS1:NM_001034957:exon13:c.1024-6G>T,NM_001034955:exon21:c.1972-6G>T,NM_001034956:exon18:c.1459-6G>T,NM_006434:exon13:c.1024-6G>T,NM_015385:exon17:c.1420-6G>T,NM_001034954:exon21:c.1906-6G>T,NM_024991:exon17:c.1147-6G>T';
$VAR6 = '4.43';

My keys are unique, so I know that is not the issue. Any ideas why?

Second, as I want to remove the empty hash key and value, how can I do this?

Many thanks for your patience and help in advance, E

3

3 Answers

2
votes

In this loop, you are iterating over @maxent_unchanged but you should be iterating over @max_score_unchanged.

foreach my $lines(@max_score_unchanged) {
  $max_unchanged{$NM[$i]} = $max_score_unchanged[$i]; ##maxent score for unchanged seq
  $i++;
}

@maxent_unchanged is what you loaded all your data into, so it has twice as many lines as @NM and @max_score_unchanged.

If you use strict; and use warnings, you'll see this error when you run:

Use of uninitialized value within @NM in hash element at test.pl line 25, <DATA> line 4.
Use of uninitialized value within @NM in hash element at test.pl line 25, <DATA> line 4.

Which will point you to the right line. You could add print "$i\n"; to that loop to see how many times it is going through, and compare it to the length of @NM and @max_score_unchanged.

I recommend you use proper indention in your code to make it much more readable.


Example:

use strict;
use warnings;
use Data::Dumper;

my @maxent_unchanged = <DATA>;
chomp (@maxent_unchanged);

my @NM;
my @max_score_unchanged;

foreach my $line(@maxent_unchanged) {
    if ($line =~ m/[a-z]/i) {
        push (@NM, $line);
    }
    else { 
        push (@max_score_unchanged, $line);
    }
}

my %max_unchanged;
for (my $i = 0; $i < @max_score_unchanged; $i++ ) {
    $max_unchanged{$NM[$i]} = $max_score_unchanged[$i]; ##maxent score for unchanged seq
}

print Dumper \%max_unchanged;

__DATA__
TTAAGGCAGCCCACCCGCAGGCT        >       1       110740688       110740688       C       T       GCCTGGGCGGGGAGGGCTGTCACAGTGCCGGCAGCAGCCCTTAAGGCAGC[C]CACCCGCAGGCTGCCGAGCGCTACCTGTATTTCCCCAACTGGGCCATGGC splicing  splicing        SLC6A17:NM_001010898:exon12:c.1816-10C>T
0.77
TTCTATCCTTTGTTTTACAGGAA        >       1       111857154       111857154       T       C       TTAAATGGAGGGAGTCCTGACTTTTGAAGTTTATCTGTTTCTATCCTTTG[T]TTTACAGGAACAGCCAGCTGAAAACTCTCCTGGCCATTGGAGGCTGGAAC splicing  splicing        CHIA:NM_201653:exon5:c.258-8T>C
10.99

I also put in an example of how you can iterate with an index over a for loop, instead of using a foreach loop since you don't use $lines anywhere.


Output:

$VAR1 = {
          'TTAAGGCAGCCCACCCGCAGGCT        >       1       110740688       110740688       C       T       GCCTGGGCGGGGAGGGCTGTCACAGTGCCGGCAGCAGCCCTTAAGGCAGC[C]CACCCGCAGGCTGCCGAGCGCTACCTGTATTTCCCCAACTGGGCCATGGC splicing  splicing        SLC6A17:NM_001010898:exon12:c.1816-10C>T' => '0.77',
          'TTCTATCCTTTGTTTTACAGGAA        >       1       111857154       111857154       T       C       TTAAATGGAGGGAGTCCTGACTTTTGAAGTTTATCTGTTTCTATCCTTTG[T]TTTACAGGAACAGCCAGCTGAAAACTCTCCTGGCCATTGGAGGCTGGAAC splicing  splicing        CHIA:NM_201653:exon5:c.258-8T>C' => '10.99'
        };

1
votes

Do you really need to copy the data into multiple arrays? Are the being used elsewhere in the script. If not, then I'd simply build the hash as I loop over the filehandle.

use strict;
use warnings;
use Data::Dumper;

my %max_unchanged;

while (my $line = <DATA>) {
    chomp $line;
    if ($line =~ /^[ACGT]/) {
        chomp(my $value = <DATA>);
        $max_unchanged{$line} = $value;
    }
}

print Dumper \%max_unchanged;

__DATA__
TTAAGGCAGCCCACCCGCAGGCT        >       1       110740688       110740688       C       T       GCCTGGGCGGGGAGGGCTGTCACAGTGCCGGCAGCAGCCCTTAAGGCAGC[C]CACCCGCAGGCTGCCGAGCGCTACCTGTATTTCCCCAACTGGGCCATGGC splicing  splicing        SLC6A17:NM_001010898:exon12:c.1816-10C>T
0.77
TTCTATCCTTTGTTTTACAGGAA        >       1       111857154       111857154       T       C       TTAAATGGAGGGAGTCCTGACTTTTGAAGTTTATCTGTTTCTATCCTTTG[T]TTTACAGGAACAGCCAGCTGAAAACTCTCCTGGCCATTGGAGGCTGGAAC splicing  splicing        CHIA:NM_201653:exon5:c.258-8T>C
10.99
1
votes

Matt has correctly pointed out the reason for your problem. In fact it would be better in this instance to iterate over a list of indices, like this

my %max_unchanged;
for my $i (0 .. $#max_score_unchanged) {
  $max_unchanged{$NM[$i]} = $max_score_unchanged[$i];
}

or you could even use map, like this

my %max_unchanged = map {
  $NM[$_] => $max_score_unchanged[$_];
} 0 .. $#max_score_unchanged;

But in the end there is no clear reason to have split your file into two arrays, and you may prefer this more concise version of your program which achieves the same end. It expects the input file as a parameter on the command line.

use strict;
use warnings;

my %max_unchanged;
while (my $key = <>) {
  next unless $key =~ /[a-z]/;
  chomp $key;
  chomp($max_unchanged{$key} = <DATA>);
}

use Data::Dump;
dd \%max_unchanged;

Given your sample input data, %max_unchanged ends up looking like this

{
  "TTAAGGCAGCCCACCCGCAGGCT        >       1       110740688       110740688       C       T       GCCTGGGCGGGGAGGGCTGTCACAGTGCCGGCAGCAGCCCTTAAGGCAGC[C]CACCCGCAGGCTGCCGAGCGCTACCTGTATTTCCCCAACTGGGCCATGGC splicing  splicing        SLC6A17:NM_001010898:exon12:c.1816-10C>T" => 0.77,
  "TTCTATCCTTTGTTTTACAGGAA        >       1       111857154       111857154       T       C       TTAAATGGAGGGAGTCCTGACTTTTGAAGTTTATCTGTTTCTATCCTTTG[T]TTTACAGGAACAGCCAGCTGAAAACTCTCCTGGCCATTGGAGGCTGGAAC splicing  splicing        CHIA:NM_201653:exon5:c.258-8T>C"          => 10.99,
}