3
votes

My goal is to do some regex and some processing on the data (line based) that comes out of a process. Since I've already got a bunch of tool in perl, I decided to use perl to solve my problem.

Let's say a process that output a large file for example :

cat LARGEFILE.txt | grep "A String"

Obviously the process I want to call is not "cat" but something that output a bunch of lines (typically 100 GB of data).

I had doubt about the performance of my perl program and I started to strip down code to the minimum. I realized that my problem might come from the way I read the output from the command in perl.

Here's my perl script :

#!/usr/bin/perl

use strict;

open my $fh, "cat LARGE.txt |";
while (<$fh>) {
        print $_ if $_ =~ qr/REGEX NOT TO BE FOUND/o;
}

I decided to compare my program with a simple bash command :

cat LARGE.txt | grep "REGEX NOT TO BE FOUND"

Results :

time cat LARGE.txt | grep "REGEX NOT TO BE FOUND"
real    0m0.615s
user    0m0.352s
sys     0m0.873s

time ./test.pl 

real    0m37.339s
user    0m36.621s
sys     0m1.766s

In my example, LARGE.txt file is about 1.3GB.

I understand that the perl solution might be slower than the cat | grep example, but I was not expecting that much difference.

Is there something wrong with my way of reading the output of a command ?

P.S. I use perl v5.10.1 on a Linux box

1
Could you replace open my $fh, "cat LARGE.txt |"; with open my $fh, '<', 'LARGE.txt'; and try again? - Lee Duhem
@LeeDuhem I've did that and the result is similar. (36 seconds). However, I really need to get this from the output of a process, not from a file. I can't use temporary files. I guess I could use named pipes but I'd rather not as it adds complexity to my processes. - Tony
Well, I guess you need to do some profiling to find out the performance bottleneck in your case. - Lee Duhem
I am very suspicious about the times you are getting for grep. A hard disk average read speed rarely exceeds 150MB/s, meaning it should take 15s just to read a 2GB file. - Borodin
I get a ~2x speed increase by precompiling the regex: my $regex = qr/REGEX NOT TO BE FOUND/o; and in the loop, print $_ if $_ =~ $regex;. - Oktalist

1 Answers

1
votes

You could try out sysread:

(stolen from: http://www.perlmonks.org/?node_id=457046)

use warnings;
use strict;
use Data::Dumper;

my $filename = "test.txt";

die "filename not found\n" unless -f $filename;

my $size = -s $filename;
my $total_read = 0;

open my $fh, "<", $filename or die "can't open $filename\n";
binmode($fh);

my $bufsize = 8192; # typical size for i/o buffers
my ( $databuf, $readbuf, $nread );
while (( $nread = sysread( $fh, $readbuf, $bufsize )) > 0 ) {
    $databuf .= $readbuf;
    process_lines_from_buffer(\$databuf);
}
print "initial size: $size\n";

sub process_lines_from_buffer{
    ### to make it efficient do not use a named variable for the buffer
    return undef if ! defined $_[0];
    while (${$_[0]} =~ s!(.*?)\n!!){
        ### do your processing
        process_line(\$1);
    }
}
sub process_line {
    print ${$_[0]}."\n";
}