EDIT: Updated for some of the tougher cases.
This is hard to do in sed
for several reasons! First, sed
makes it hard to work on the standard multiline paragraphs we have in text. Another reason is that sed
is not standardized across all platforms, so you never know what sorts of patterns or options it will support. So perhaps someone else can help you with that part.
But it is very easy to do in Perl.
use 5.10.0;
use strict;
use warnings;
my @texts = split /\R{2,}/, <<'END_OF_TEXT';
This is hard to do in sed for several reasons! First, sed makes it
hard to work on the standard multiline paragraphs we have in text.
Another reason is that sed is not standardized across all platforms,
so you never know what sorts of patterns or options it will support.
So perhaps someone else can help you with that part.
It was a dark and story night. Dr. Jones looked up
at the manor house with trepidation. Lightning
flashes could be seen both outside the house and
inside it, as St. Elmo's fire played across the lofty
spires. Mrs. Smith's fancy-dress party there on St. James's St.
was clearly going to be a lively one! Would anyone even notice
his mischief in time? Dr. Jones chortled with glee as he scampered
up the step.
END_OF_TEXT
my $sentence_rx = qr{
(?: (?<= ^ ) | (?<= \s ) ) # after start-of-string or whitespace
\p{Lu} # capital letter
.*? # a bunch of anything
(?<= \S ) # that ends in non-whitespace
(?<! \b [DMS]r ) # but isn't a common abbreviation
(?<! \b Mrs )
(?<! \b Sra )
(?<! \b St )
[.?!] # followed by a sentence ender
(?= $ | \s ) # in front of end-of-string or whitespace
}sx;
for my $paragraph (@texts) {
say "NEW PARAGRAPH";
say "Looking for each sentence.";
my $count = 0;
while ($paragraph =~ /($sentence_rx)/g) {
printf "\tgot sentence %d: <%s>\n", ++$count, $1;
}
say "\nLooking for exactly two sentences.";
if ($paragraph =~ / ^ ( (?: $sentence_rx \s*? ){2} ) /x) {
say "\tgot two sentences: <<$1>>";
}
print "\n";
}
When run, that produces this output:
NEW PARAGRAPH
Looking for each sentence.
got sentence 1: <This is hard to do in sed for several reasons!>
got sentence 2: <First, sed makes it
hard to work on the standard multiline paragraphs we have in text.>
got sentence 3: <Another reason is that sed is not standardized across all platforms,
so you never know what sorts of patterns or options it will support.>
got sentence 4: <So perhaps someone else can help you with that part.>
Looking for exactly two sentences.
got two sentences: <<This is hard to do in sed for several reasons! First, sed makes it
hard to work on the standard multiline paragraphs we have in text.>>
NEW PARAGRAPH
Looking for each sentence.
got sentence 1: <It was a dark and story night.>
got sentence 2: <Dr. Jones looked up
at the manor house with trepidation.>
got sentence 3: <Lightning
flashes could be seen both outside the house and
inside it, as St. Elmo's fire played across the lofty
spires.>
got sentence 4: <Mrs. Smith's fancy-dress party there on St. James's St.
was clearly going to be a lively one!>
got sentence 5: <Would anyone even notice
his mischief in time?>
got sentence 6: <Dr. Jones chortled with glee as he scampered
up the step.>
Looking for exactly two sentences.
got two sentences: <<It was a dark and story night. Dr. Jones looked up
at the manor house with trepidation.>>
Hope this helps. Every time I try to do this in sed
, it becomes very complicated.
Certainly you can only go so far in sed
, and I virtually always need to go further than it allows me to go. If nothing else, I need a reliable way to know what flavor of regexes and switches will be supported, and you cannot do that portably with sed
. Writing portable shell scripts is VERY, VERY much more difficult that people often think. I run on these operating systems:
- OpenBSD
- Darwin (that means Macs)
- Linux
- Solaris
- AIX
The Greatest Common Factor between all those is so tiny, you can never get anything interesting done — at least, not portably — with the shell tools. It really is very frustrating. It’s amazing what contortions Perl’s Configure shell script has to go through.