Processing 1.5 Million Line File under 5 Minutes

Question

After long searches on the net, I decide to ask here regarding my problem.I have a CSV file set (36 files total), coming every 5 minutes. Each file contain around 1.5 million lines. I need to process this files in 5 minutes. I have to parse this files and create required directory from them inside the storage zone. Each unique line will be then translated to a file and put inside related directory. Also related lines will be written inside related files. As you see there are lots of I/O operation.

I can finish total 12 files for around 10 minutes. Target is to finish 36 in 5 minutes. I am using PERL to complete this operation. My seen problem is system calls for i/o operations.

I want to control file handlers and I/O buffer in Perl so that I will not have to go to write to file every time. Here is where I got lost actually. Plus creating directories seems also consuming too much time.

I search CPAN ,web to find some lead that can put light on my way but no luck. Does anybody have a suggestion in that subject ? Where should I read or how should I proceed ? I believe that Perl is more than capable to fix this issue, but I guess I am not using correct tools.

open(my $data,"<", $file);
my @lines = <$data>;

foreach (@lines) {
    chomp $_;
    my $line = $_;

    my @each = split(' ',$line);
    if (@each == 10) {
       my @logt = split('/',$each[3]);
       my $llg=1;

       if ($logt[1] == "200") {
           $llg = 9;
       }

       my $urln = new URI::URL $each[6];
       my $netl = $urln->netloc;

       my $flnm = md5_hex($netl);
       my $urlm = md5_hex($each[6]);

       if ( ! -d $outp."/".$flnm ) {
          mkdir $outp."/".$flnm,0644;
       }

       open(my $csvf,">>".$outp."/".$flnm."/".$time."_".$urlm) or die $!;
       print $csvf int($each[0]).";".$each[2].";".$llg."\n";
       close $csvf;   #--->> I want to get rid of this so I can use buffer      
    }
    else {
       print $badf $line;
    }

}

Assume that above code is used inside a subroutine and are threaded 12 times. Parameter for above code is filename . I wanna get rid of close. Cause every time I open and close a file makes a call for system I/O which cause slowness. This is my assumption of course and I am more then open to any suggestion

Thanks in Advance

Are you using system() to copy files and make directories? — Filippo Lauria
Have you parallelised it so each of your (likely) 4 cores gets 9 files maybe? Have you profiled your code using the excellent search.cpan.org/perldoc/Devel::NYTProf — Mark Setchell
You are opening a file for append on every single one of the 54 million lines... — Mark Setchell
The reason for my previous question is that I am thinking you would be better holding the data in memory and then writing to disk all at the end. So, create a hash using the filename as the key and appending your data to the end of the hash element rather than writing to a file. Then, when you get to the end, or you feel there is enough stuff in memory, flush it all out to real disk files. — Mark Setchell

amon amon · Accepted Answer · 2014-03-24T11:46:36

It seems possible that you'll open the same file multiple times. If that is so, it might be beneficial to collect the information in a data structure, and only write to the files after the loop has completed. This avoids testing for existence of the same directory repeatedly, and opens each output file only once.

We should also get rid of URI::URL – creating a new object during each loop iteration is too expensive considering your performance requirements. If your URLs all look like http://user:[email protected]/path/ or https://example.com/, we could use a simple regex instead.

open my $data, "<", $file or die "Can't open $file: $!";

my %entries;  # collect entries here during the loop

# only read one line at a time, don't keep unnecessary ballast around
while (my $line = <$data>) {
    chomp $line;

    my @each = split(' ',$line);

    if (@each != 10) {
        print $badf $line;
        next;
    }

    my (undef, $logt) = split('/', $each[3]);
    my $llg = ($logt == 200) ? 9 : 1;

    my $url = $each[6];
    my ($server) = $url =~ m{\A\w+://([^/]+)};

    push @{ $entries{$server}{$url} }, sprintf "%d;%s;%d\n", $each[0], $each[2], $llg;
}

while (my ($dir, $files) = each %entries) {
    my $dir_hash = md5_hex($dir);
    my $dirname = "$outp/$dir_hash";

    mkdir $dirname, 0644 or die "Can't create $dirname: $!" unless -d $dirname;

    while (my ($file, $lines) = each %$files) {
        my $file_hash = md5_hex($file);
        my $filename = "$dirname/${time}_${file_hash}";

        open my $csv_fh, ">>", $filename or die "Can't open $filename: $!";
        print { $csv_fh } @$lines;
    }
}

I also cleaned up other aspects of the code (e.g. variable naming, error handling). I moved the call to md5_hex out of the main loop, but depending on the kind of data it may be better to not delay the hashing.

Processing 1.5 Million Line File under 5 Minutes

1 Answers