sed - How to get the first 2 sentences of a paragraph?

2

votes

Supposed I have a paragraph:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Using sed, I how do I get a certain number of sentences, in this case 2 sentences, delimited by a period and extracting only the following text from the given paragraph.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

regexbashcommand-linesed

I take it you're happy you won't end up with text like "It was a dark and story night. Dr. Jones looked up at the manor house with trepidation" – Gareth

3

votes

sed 's/\(^[^.]*\.[^.]*\.\)\(.*$\)/\1/g'

Explanation:

\( start group

^ match start of line

[^.]* match any number of non period characters

\. match period

[^.]* match any number of non period characters

\. match period

\) end group

$ start group .*$ match everything up to end of line $ end group.

\1 Replace entire line with first group.

3

votes

EDIT: Updated for some of the tougher cases.

This is hard to do in sed for several reasons! First, sed makes it hard to work on the standard multiline paragraphs we have in text. Another reason is that sed is not standardized across all platforms, so you never know what sorts of patterns or options it will support. So perhaps someone else can help you with that part.

But it is very easy to do in Perl.

use 5.10.0;
use strict;
use warnings;

my @texts = split /\R{2,}/, <<'END_OF_TEXT';
This is hard to do in sed for several reasons! First, sed makes it
hard to work on the standard multiline paragraphs we have in text.
Another reason is that sed is not standardized across all platforms,
so you never know what sorts of patterns or options it will support.
So perhaps someone else can help you with that part.

It was a dark and story night. Dr. Jones looked up
at the manor house with trepidation. Lightning
flashes could be seen both outside the house and
inside it, as St. Elmo's fire played across the lofty
spires. Mrs. Smith's fancy-dress party there on St. James's St.
was clearly going to be a lively one! Would anyone even notice
his mischief in time?  Dr. Jones chortled with glee as he scampered
up the step.
END_OF_TEXT


my $sentence_rx = qr{
    (?: (?<= ^ ) | (?<= \s ) )  # after start-of-string or whitespace
    \p{Lu}                      # capital letter
    .*?                         # a bunch of anything
    (?<= \S )                   # that ends in non-whitespace
    (?<! \b [DMS]r  )           # but isn't a common abbreviation
    (?<! \b Mrs )
    (?<! \b Sra )
    (?<! \b St  )
    [.?!]                       # followed by a sentence ender
    (?= $ | \s )                # in front of end-of-string or whitespace
}sx;

for my $paragraph (@texts) {
    say "NEW PARAGRAPH";
    say "Looking for each sentence.";

    my $count = 0;
    while ($paragraph =~ /($sentence_rx)/g) {
        printf "\tgot sentence %d: <%s>\n", ++$count, $1;
    }

    say "\nLooking for exactly two sentences.";

    if ($paragraph =~ / ^ ( (?: $sentence_rx \s*? ){2} ) /x) {
        say "\tgot two sentences: <<$1>>";
    }
    print "\n";
}

When run, that produces this output:

NEW PARAGRAPH
Looking for each sentence.
        got sentence 1: <This is hard to do in sed for several reasons!>
        got sentence 2: <First, sed makes it
hard to work on the standard multiline paragraphs we have in text.>
        got sentence 3: <Another reason is that sed is not standardized across all platforms,
so you never know what sorts of patterns or options it will support.>
        got sentence 4: <So perhaps someone else can help you with that part.>

Looking for exactly two sentences.
        got two sentences: <<This is hard to do in sed for several reasons! First, sed makes it
hard to work on the standard multiline paragraphs we have in text.>>

NEW PARAGRAPH
Looking for each sentence.
        got sentence 1: <It was a dark and story night.>
        got sentence 2: <Dr. Jones looked up 
at the manor house with trepidation.>
        got sentence 3: <Lightning
flashes could be seen both outside the house and
inside it, as St. Elmo's fire played across the lofty
spires.>
        got sentence 4: <Mrs. Smith's fancy-dress party there on St. James's St.
was clearly going to be a lively one!>
        got sentence 5: <Would anyone even notice
his mischief in time?>
        got sentence 6: <Dr. Jones chortled with glee as he scampered 
up the step.>

Looking for exactly two sentences.
        got two sentences: <<It was a dark and story night. Dr. Jones looked up 
at the manor house with trepidation.>>

Hope this helps. Every time I try to do this in sed, it becomes very complicated. Certainly you can only go so far in sed, and I virtually always need to go further than it allows me to go. If nothing else, I need a reliable way to know what flavor of regexes and switches will be supported, and you cannot do that portably with sed. Writing portable shell scripts is VERY, VERY much more difficult that people often think. I run on these operating systems:

OpenBSD
Darwin (that means Macs)
Linux
Solaris
AIX

The Greatest Common Factor between all those is so tiny, you can never get anything interesting done — at least, not portably — with the shell tools. It really is very frustrating. It’s amazing what contortions Perl’s Configure shell script has to go through.

2

votes

You can use awk

 awk -vRS="." 'NR<=2' ORS="." file

Set both input/output record separator to ".", then print first and second record (NR<=2). If your sentences does not have arbitrary dots as in Mr. James, then the above should be sufficient for your needs without having to craft complex regular expressions.

2

votes

This might work for you:

 sed 's/\(\.[^.]*\.\).*/\1/' file

Provided each paragraph is on a separate line.

This might work for over newlines:

echo -e "a b c.\nx y z.\na b c" | sed ':a;$!N;/\(\.[^.]*\.\).*/!{$!ba};s//\1/;q'       
a b c.
x y z.

1

votes

This will work for your example:

sed 's/^\(\([^.]*\.\)\{2\}\).*/\1/'

or:

sed -r 's/^(([^.]*.){2}).*/\1/'

sed - How to get the first 2 sentences of a paragraph?

5 Answers