0
votes

I've struggled with regExp in Perl for some reason from the start and have a quick script i wrote here to count sentences in some text being inputted that won't work. I just get the number 1 back at the end and I know in the file specified there is several so the count should be higher. I can't see the issue...

#!C:\strawberry\perl\bin\perl.exe

#strict
#diagnostics
#warnings

$count = 0;
$file = "c:/programs/lorem.txt";

open(IN, "<$file") || die "Sorry, the file failed to open: $!";


while($line = <IN>)
{     
    if($line =~ m/^[A-Z]/)
    {
    $count++;
    }
}

close(IN);

print("Sentances count was: ($count)");

The file lorem.txt is here......

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. Phasellus viverra nulla ut metus varius laoreet. Quisque rutrum. Aenean imperdiet. Etiam ultricies nisi vel augue. Curabitur ullamcorper ultricies nisi. Nam eget dui. Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum. Nam quam nunc, blandit vel, luctus pulvinar, hendrerit id, lorem. Maecenas nec odio et ante tincidunt tempus. Donec vitae sapien ut libero venenatis faucibus. Nullam quis ante. Etiam sit amet orci eget eros faucibus tincidunt. Duis leo. Sed fringilla mauris sit amet nibh. Donec sodales sagittis magna. Sed consequat, leo eget bibendum sodales, augue velit cursus nunc,

3
Are you making the assumption that every line is a sentence? I think your regex needs to be a tad more broad if you're counting sentences.Brad Christie
We need to see at least a couple of lines from your file that you think should be counted...ysth
Also, I'm not as familiar with perl's regex, but this works with my tests (albeit primitive, it finds sentences, then you just have to count the matches) [A-Z].*?[.!?]\s*(?=[A-Z]|[\r\n]+|$) [using global flag]Brad Christie
strict, diagnostics and warnings don't help unless they are written as use strict;, and then either use diagnostics; or use warnings;Eric Strom

3 Answers

2
votes

I don't know what's in your lorem.txt, but the code that you've given is not counting sentences. It's counting lines, and furthermore it's counting lines that begin with a capital letter.

This regex:

/^[A-Z]/

will only match at the beginning of a line, and only if the first character on that line is capitalized. So if you have a line that looks like it. And then we went... it will not be matched.

If you want to match all capital letters, just remove the ^ from the beginning of the regex.

2
votes

This does not answer your specific question about regexp, but you could consider using a CPAN module: Text::Sentence. You can look at its source code to see how it defines a sentence.

use warnings;
use strict;
use Data::Dumper;
use Text::Sentence qw(split_sentences);

my $text = <<EOF;
One sentence.  Here is another.
And yet another.
EOF

my @sentences = split_sentences($text);
print Dumper(\@sentences);

__END__

$VAR1 = [
          'One sentence.',
          'Here is another.',
          'And yet another.'
        ];

A google search also turned up: Lingua::EN::Sentence

1
votes

You are currently counting all lines that begin with a capital letter. Perhaps you intend to count all words that start with a capital letter? If so, try:

m/\W[A-Z]/

(Although this is not a robust count of sentences)

On another note, there is no need to do the file manipulation explicitly. perl does a really good job of that for you. Try this:


$ARGV[ 0 ] = "c:/programs/lorem.txt" unless @ARGV;
while( $line = <> ) {
...

If you do insist on doing an explicit open/close, it is considered bad practice to use raw filehandles. In other words, instead of "open IN...", do "open my $fh, '<', $file_name;"