1
votes

Looking for help on writing a Perl program that takes an input file and performs manipulations based on follow-up commands. I'm a beginning Perl student so please don't get too advance in suggestions. The structure that I have so far is a main program and 4 subs.

I'm having trouble with two parts:

  1. Writing the portion of the main segment that creates a unique record for each line from the input file (which is fixed width format). I think this should be done with substr but I don't know much more of how this should be structured. Unpack is beyond the scope of my learning so far.

  2. One of the functions called in the main program is a "distance" sub which will calculate distance between atoms. I'm thinking this should be a For Loop inside a For loop. Any thoughts on what approach I should take?

The records should store an array of atom records (one record/atom per newline):

• The atom's serial number, 5 digits. (cols 7 - 11)

• The three-letter name of the amino acid to which it belongs (cols 18 - 20)

• The atom's three coordinates real number as decimal & Orthogonal Coordinates (x,y,z) (cols 31 - 54 )
For X in Angstroms cols. 31-38
For Y in Angstroms cols. 39-46
For Z in Angstroms cols. 47-54

• The atom's one- or two-letter element name (e.g. C, O, N, Na) (cols 77-78 )

sub Distance # take an array of atom records and return the max distance
# between all pairs of atoms in that array. (cols 31-54)

Here is sample text from an input file.

# truncating for testing purposes. Actual data is aprox. 100 columns     
# and starts with ATOM or HETATM    
__DATA__   
ATOM   4743  CG  GLN A 704      19.896  32.017  54.717  1.00 66.44           C    
ATOM   4744  CD  GLN A 704      19.589  30.757  55.525  1.00 73.28           C    
ATOM   4745  OE1 GLN A 704      18.801  29.892  55.098  1.00 75.91           O  

Here is what I have so far for the main and sub for make records. I hate to be lame but I don't have anything to show for the Distance sub yet so don't worry about giving code, any suggestions on how to approach would be very appreciated.

use warnings;
use strict; 

my @fields;
my @recs;

while ( <DATA> ) {
chomp;
@fields = split(/\s+/);
push @recs, makeRecord(@fields);
}

for (my $i = 0; $i < @recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
  freq => \&freq,
  length => \&length,
  density => \&density,
  help => \&help, 
  quit => \&quit
);

print "Enter a command: ";
  while ( <STDIN> ) {
  chomp; 
  my @line = split( /\s+/);
  my $command = shift @line;
  if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
    print "Command must be: freq, length, density or quit\n";
  }
    else {
    $command_table{$command}->();
  }
print "Enter a command: ";
}

sub makeRecord 
# Read the entire line and make records from the lines that contain the 
# word ATOM or HETATM in the first column. Not sure how to do this:
{
 my %record = 
 (
 serialnumber => shift,
 aminoacid => shift,
 coordinates => shift,
 element  => [ @_ ]
 );
 return\%record;
 }
3
Why is unpack out of your scope of learning when you're making use of a dispatch table?Zaid
@Zaid My class just learned functions but we haven't learned Unpack.Koala
The one caveat with using unpack is that you have to get the template right. You are given the start and end columns for each field you have to read. The @ code jumps to the desired starting position, except it's zero-based, so to jump to column 7, for example, you can use @6, and then to jump to column 18, you can use @17. That way you don't have to count the spaces you are skipping.Narveson
The man page for unpack is at perldoc.perl.org/functions/unpack.html.Narveson
@Narveson thanks. Unpack seems straightforward enough. Nonetheless, my professor has asked for us to wait until next semester before getting into it.Koala

3 Answers

1
votes

There's Perl code available online for working with PDB files (which obviously you are doing). I'm not suggesting just using a module you downloaded and be done with it, as surely your instructor wouldn't approve, and you wouldn't learn that much ;) But you could take a look at some of the code that's offered and try to see whether some bits there address your problem.

I did a quick bit of googling, I saw that there's ParsePDB.pm (for example). You can find the web page here. I didn't have a look at the code or the functionality though, I'm just hoping there will be something in there that you may find helpful.

EDIT 1

Okay, it's 14 hours later now, and I felt like doing some coding, so as you have not yet accepted an answer I thought I could just ignore my own advice and draw up something (as you will notice I have copied Zaid's data structure)...

#!/usr/bin/perl

use warnings;
use strict;

sub makeRecord {
   my ($ser_num, $aa, $x, $y, $z, $element) = @_;
   # copying Zaid now as her/his structure looks very sensible!
   my $record = {
                  serial  => $ser_num,
                  aa      => $aa,
                  element => $element,
                  xyz     => [$x, $y, $z],
                };
   return $record;
}


my $file = shift @ARGV;
my @records; # will be an array of hash references

open FILE, "<$file" or die "$!";
while (<FILE>) {
   if (/^ATOM|^HETATM/) { # only get the structure data lines
      chomp; # not necessary here, but good practice I'd say

      my @fields = split; # by default 'split' splits on whitespace

      # now use an array slice to only pass the array elements
      # you're interested in (using the positional indices from @fields):
      push @records, makeRecord(@fields[1,3,6,7,8,11]);
   }
}
close FILE;

EDIT 2

Concerning the distance subroutine: the for loop inside the for loop should do the job, but this is the brute force way which might take quite a while (as you'd have to do (number_of_atoms)^2 calculations), depending on the size of your input molecule. For the purpose of your assignment the brute force approach is probably acceptable; in other cases you'd have to decide whether to favour ease of coding, or computational speed. If your instructor also wants you to keep the latter in mind, you could take a look at this page (I know you actually want the largest distance, and you're in 3D, not 2D...)

Ok, now I just hope that you managed to find some helpful bits and pieces in here :)

1
votes

It is strange that unpack is out of scope when I can see use of a dispatch table. It would be silly to overlook using unpack if fixed-format files are being processed. There is nothing 'advanced' going on in the code below:

use strict;
use warnings;
use Data::Dump 'dump';   # Use this if you want 'dump' function to work

my @records;
while ( my $record = <DATA> ) {

    next unless $record =~ /^ATOM|^HETATM/;  # Skip unwanted records

    # unpack minimizes the amount of work the code has to do ...
    # ... especially since you only want a small part of the file
    # 'x' tokens are ignored, 'A' tokens are read ...
    # The number following each token represents repetition count ...
    # ... so in this case the first 6 characters are ignored ...
    # ... and the next 5 are assigned to $serNo

    my ( $serNo, $aminoAcid, $xCoord, $yCoord, $zCoord )
        = unpack 'x6A5x6A3x10A10A10A10', $record;        # Get only what you want

    # Assign data to a hash reference

    my $recordStructure = {
                            serialnumber => $serNo,
                            aminoacid    => $aminoAcid,
                            coordinates  => [ $xCoord, $yCoord, $zCoord ],
                          };

    push @records, $recordStructure;  # Append current record
}

# 'dump' is really useful to view data structures. No need for PrintRec!!

dump @records;
1
votes

Your records have a fixed-width format, so use unpack to break each record into the fields of interest. Use the stated column positions of each field to construct a template for use with unpack.

my @field_specs = (
    {begin =>  7, end => 11, name => 'serialnumber'},
    {begin => 18, end => 20, name => 'aminoacid'},
    {begin => 31, end => 38, name => 'X'}, 
    {begin => 39, end => 46, name => 'Y'},
    {begin => 47, end => 54, name => 'Z'}, 
    {begin => 77, end => 78, name => 'element'},
);
my $unpack_template;    
my @col_names;
for my $spec (@field_specs) {
    my $offset = $spec->{begin} - 1;
    my $width  = $spec->{end} - $offset;
    $template .= "\@${offset}A$width";
    push @col_names, $spec->{name};
}
print "Ready to read @col_names\n using template $template ...\n";

# prints 
# Ready to read serialnumber aminoacid X Y Z element 
#  using template @6A5@17A3@30A8@38A8@46A8@76A2 ...

my @recs;
while ( <DATA> ) {                
    my %record;
    @record{@col_names} = unpack($unpack_template, $_);    
    push @recs, \%record;                
}