How to parse a file, create records and perform manipulations on records including frequency of terms and distance calculations

Question

I'm a student in an intro Perl class, looking for suggestions and feedback on my approach to writing a small (but tricky) program that analyzes data about atoms. My professor encourages forums. I am not advanced with Perl subs or modules (including Bioperl) so please limit responses to an appropriate 'beginner level' so that I can understand and learn from your suggestions and/or code (also limit "Magic" please).

The requirements of the program are as follows:

Read a file (containing data about Atoms) from the command line & create an array of atom records (one record/atom per newline). For each record the program will need to store:

• The atom's serial number (cols 7 - 11)
• The three-letter name of the amino acid to which it belongs (cols 18 - 20)
• The atom's three coordinates (x,y,z) (cols 31 - 54 )
• The atom's one- or two-letter element name (e.g. C, O, N, Na) (cols 77-78 )

Prompt for one of three commands: freq, length, density d (d is some number):

• freq - how many of each type of atom is in the file (example Nitrogen, Sodium, etc would be displayed like this: N: 918 S: 23
• length - The distances among coordinates
• density d (where d is a number) - program will prompt for the name of a file to save computations to and will containing the distance between that atom and every other atom. If that distance is less than or equal to the number d, it increments the count of the number of atoms that are within that distance, unless that count is zero into the file. The output will look something like:
1: 5
2: 3
3: 6
... (very big file) and will close when it finishes.

I'm looking for feedback on what I have written (and need to write) in the code below. I especially appreciate any feedback in how to approach writing my subs. I've included sample input data at the bottom.

The program structure and function descriptions as I see it:

$^W = 1; # turn on warnings
use strict; # behave!

my @fields;
my @recs;

while ( <DATA> ) {
 chomp;
 @fields = split(/\s+/);
 push @recs, makeRecord(@fields);
}

for (my $i = 0; $i < @recs; $i++) {
 printRec( $recs[$i] );
}
    my %command_table = (
 freq => \&freq,
 length => \&length,
 density => \&density,
 help => \&help, 
 quit => \&quit
 );

print "Enter a command: ";
while ( <STDIN> ) {
 chomp; 
 my @line = split( /\s+/);
 my $command = shift @line;
 if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
    print "Command must be: freq, length, density or quit\n";
    }
  else {
    $command_table{$command}->();
    }
 print "Enter a command: ";
 }

sub makeRecord 
    # Read the entire line and make records from the lines that contain the 
    # word ATOM or HETATM in the first column. Not sure how to do this:
{
 my %record = 
 (
 serialnumber => shift,
 aminoacid => shift,
 coordinates => shift,
 element  => [ @_ ]
 );
 return\%record;
}

sub freq
    # take an array of atom records, return a hash whose keys are 
    # distinct atom names and whose values are the frequences of
    # these atoms in the array.  

sub length
    # take an array of atom records and return the max distance 
    # between all pairs of atoms in that array. My instructor
    # advised this would be constructed as a for loop inside a for loop. 

sub density
    # take an array of atom records and a number d and will return a
    # hash whose keys are atom serial numbers and whose values are 
    # the number of atoms within that distance from the atom with that
    # serial number. 

sub help
{
    print "To use this program, type either\n",
          "freq\n",
          "length\n",
          "density followed by a number, d,\n",
          "help\n",
          "quit\n";
}

sub quit
{
 exit 0;
}

# truncating for testing purposes. Actual data is aprox. 100 columns 
# and starts with ATOM or HETATM.
__DATA__
ATOM   4743  CG  GLN A 704      19.896  32.017  54.717  1.00 66.44           C  
ATOM   4744  CD  GLN A 704      19.589  30.757  55.525  1.00 73.28           C  
ATOM   4745  OE1 GLN A 704      18.801  29.892  55.098  1.00 75.91           O

+1 for asking how to improve a good effort instead of asking SO to make that effort for you. Sadly, lately on StackOverflow that has been a rarity. BTW, like FM said, your Perl level seems to be not as low as you seem to believe, if all of the above code was written by you or at least you understand 100% of it. — DVK
Thanks DVK. The code above took me a hours to write and get working. Would probably take someone who knows what their doing an hour. FM's telling me I'm advancing made my night! I always think I'm doing it all wrong. The SO community is so encouraging and supportive. I can't imagine learning on my own. Thanks for the +1. Can't wait to be on the answering side of the community. — Koala

FMc FMc · Accepted Answer · 2010-12-04T01:00:16

It looks like your Perl skills are advancing nicely -- using references and complex data structures. Here are a few tips and pieces of general advice.

Enable warnings with use warnings rather than $^W = 1. The former is self-documenting and has the advantage being local to the enclosing block rather than being a global setting.
Use well-named variables, which will help document the program's behavior, rather than relying on Perl's special $_. For example:
```
while (my $input_record = <DATA>){
}
```
In user-input scenarios, an endless loop provides a way to avoid repeated instructions like "Enter a command". See below.
Your regex can be simplified to avoid the need for repeated anchors. See below.
As a general rule, affirmative tests are easier to understand than negative tests. See the modified if-else structure below.
Enclose each part of program within its own subroutine. This is a good general practice for a bunch of reasons, so I would just start the habit.
A related good practice is to minimize the use of global variables. As an exercise, you could try to write the program so that it uses no global variables at all. Instead, any needed information would be passed around between the subroutines. With small programs one does not necessarily need to be rigid about the avoidance of globals, but it's not a bad idea to keep the ideal in mind.
Give your length subroutine a different name. That name is already used by the built-in length function.
Regarding your question about makeRecord, one approach is to ignore the filtering issue inside makeRecord. Instead, makeRecord could include an additional hash field, and the filtering logic would reside elsewhere. For example:
```
my $record = makeRecord(@fields);
push @recs, $record if $record->{type} =~ /^(ATOM|HETATM)$/;
```

An illustration of some of the points above:

use strict;
use warnings;

run();

sub run {
    my $atom_data = load_atom_data();
    print_records($atom_data);
    interact_with_user($atom_data);
}

...

sub interact_with_user {
    my $atom_data = shift;
    my %command_table = (...);

    while (1){
        print "Enter a command: ";
        chomp(my $reply = <STDIN>);

        my ($command, @line) = split /\s+/, $reply;

        if ( $command =~ /^(freq|density|length|help|quit)$/ ) {
            # Run the command.
        }
        else {
            # Print usage message for user.
        }
    }
}

...

How to parse a file, create records and perform manipulations on records including frequency of terms and distance calculations

2 Answers