How to replicate cat/sort/uniq in native Perl code?

2

votes

I'm building on the knowledge shared in a previous question: What native Perl code replaces `cut`?

A Perl script uses this code:

my $cmd = "cat $TMPDIR/files.* | sort | uniq > $File"
`$cmd`

I'm trying to rebuild the above function with native Perl to run on MS Windows. I have this so far, but it's not quite working:

my $globPat = "$TMPDIR/parts.*"
my $outFile = "$TMPDIR/out.txt"
my %lines;

# 1) glob all files
while (my $glob = glob($globPat)) {
    open(IN, "<", "$glob") or die("Can't read $glob");
    # collect lines as unique keys in a hash
    ++$lines{ ($_)[1] } while <IN>;
    close(IN);
}

# sort the key and save values to $glueFile
open(OUT, ">", "$outFile") or die("ERROR: Can't write $outFile");
foreach my $key (sort keys %lines) {
    print OUT $lines{$key} . "\n";
}
close(OUT)

I'm getting a variety of errors that bounce around (line numbers) as I try to troubleshoot. Can someone help sort out 1) how to properly use glob, 2) how to add lines read from the various files to one hash key and 3) sort the hash's keys (lines) and print them to the new output file.

perlhash

($_)[1] .... ? Why would that work when you loop over a list of scalar values? – TLP

metacpan.org/pod/PerlPowerTools – ysth

2

votes

You can achieve it with a one-liner, and use the END block to do sorting, like:

perl -ne '$h{ $_ } = 1; END { print sort keys %h }' $TMPDIR/files.*

2

votes

List::MoreUtils::uniq can do the work of the function with the same name. For cat, I would simply use <>. Though of course, you should know that that is a "useless use of cat" you have there. Sort is sort.

use strict;
use warnings;
use List::MoreUtils qw(uniq);

my @list = uniq(<>);
my @sorted = sort @list;

print @sorted;

Note that you do not have to add newline to the lines, because they already have one.

If you do not wish to use the module, the code for uniq is fairly simple and can just be copy/pasted.

sub uniq {
    my %seen;
    grep { not $seen{$_}++ } @_;
}

2

votes

There are a couple of problems with your code

I assume you have extrapolated the expression ++$lines{ ($_)[1] } from something like ++$lines{ (split)[1] }. But there is a difference because split returns a list of fields. ($_)[1] is attempting to extract the second element from a one-element list. You want simply ++$lines{$_}
in print OUT $lines{$key} you are printing the values of the hash %lines. But it is being used simply as a device to create a unique list, and the values are just the count of the times each line appears in the files. You want the keys instead, so print OUT $key, "\n" is correct

There are also a few instances of bad practice which don't stop your program working but should be fixed anyway.

Local variables should use only lower case letters, numbers, and underscores. Capital letters are reserved for global identifiers
You should use lexical file handles, such as open my $in_fh, ... instead of open IN, .... Global variables are a bad idea in general, and it also obviates the need to close a file handle at the end of its scope as it will happen automatically
You should always put $! into the die string when an I/O operation has failed. It is often adequate to use just die $!, as the output includes the source file name and line number
It is best to use catfile from File::Spec::Functions rather than just using string concatenation. It handles things like multiple path separators properly and is also clearer to read
You shouldn't put quotes around a bare variable. So, for instance, open(IN, "<", "$glob") should be open(IN, "<", $glob). Adding quotes will make no difference at best, and at worst it will provide you with a completely different string

This is how I would refactor your program

use strict;
use warnings;

use File::Spec::Functions 'catfile';

my $temp_dir = '.';

my $glob_pat = catfile($temp_dir, 'parts.*');
my $out_file = catfile($temp_dir, 'out.txt');

my %lines;

while ( my $parts_file = glob($glob_pat) ) {
    open my $in_fh, '<', $parts_file or die qq{Can't read "$parts_file": $!};
    ++$lines{$_} while <$in_fh>;
}

open my $out_fh, '>', $out_file or die qq{ERROR: Can't write to "$out_file": $!};
for my $line (sort keys %lines) {
    print $out_fh $line, "\n";
}

close $out_fh;

1

votes

You can use glob in this way also:

my @files = glob("$TMPDIR/parts.*");
foreach my $file (@files)
{
    open my $fh, "<", $file or die "couldn't open '$file': $!";
    while (<$fh>)
    {
        #do whatever you want to do;
    }
}

How to replicate cat/sort/uniq in native Perl code?

4 Answers