Using sed/awk, I need to remove all lines in a file from the first occurrence of pattern1 up-to (but not including) the last occurrence of pattern2

Question

Using sed/awk, I need to remove all lines in a file from the first occurrence of pattern1 up-to (but not including) the last occurrence of pattern2.

Consider the following text:

    <entity name="good">
    </entity>
    <entity name="bad">
    stuff to delete
    </entity>
    <entity name="bad">
    stuff to remove
    </entity>
    <entity name="bad2">
    </entity>
    <entity name="deleteMe2">
    </entity>
    <entity name="bad2">
    </entity>
    <entity name="good">
    </entity>

I would like the following outcome

<entity name="good">
</entity>
<entity name="bad2">
</entity>
<entity name="good">
</entity>

I know how to do a range in sed, but can't figure out how to match the last occurrence of 'bad2' and not include it in the delete. The below of course will not work as it will match the first bad2 and not remove the 'deleteme2' or 2nd occurrenc of 'bad2'.

sed -i '/<entity name="bad"/,/<entity name="bad2"/d' file.xml

There can be hundreds of 'bad'/'deleteMe2'/'bad2' lines in the file I am dealing with, so a simple line count won't work. I am fine if this is multiple commands (it does not have to be just a single one), but the more efficient the better because the file being modified can be quite large. As well, the -i is because I want to do an in place delete of the lines between.

NOTE: I am more familiar with SED than I am with AWK, but I am open to all the help I can get:)

Looks a lot like XML. Is it XML? Because if so, it's almost certainly better to use a parser. — Sobrique
Yes, it is XML, and I totally get not using sed/awk to modify XML, but the XML definition is quite simple in this case. Literally what you see above with some additional text. One constraint I didn't really mention is that I will most likely have to do this on a Windows box, most likely with gnused or gawk. I will consider perl as an option if there is not a way to do what I am asking in sed/awk. — Nick Vallely
When you repeat the title in your question, the question is more clear. I skipped the title first and got confused. — Walter A
Which sections do you need to delete vs keep? It's not clear from your "outcome", which deletes 2 bad sections, 1 bad2 section, and 1 deleteMe2 section. — Brian
@Brian Delete the first line with bad and all following lines until the last bad2 section is completed. Everything in between is bad. — Walter A

Sobrique Sobrique · Accepted Answer · 2016-02-02T21:50:59

This looks like XML to me, so I would strongly suggest that regex isn't the tool for the job. Use a parser instead:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig -> new -> parsefile ( 'your_file.xml' ) ;
$_ -> delete for $twig -> findnodes ( '//entity[@name="bad"]');
$twig -> set_pretty_print('indented_a');
$twig -> print;

Or perhaps more comprehensively:

for my $entity ( $twig -> findnodes ( '//entity') ) {
   if ( $entity -> att('name') eq "bad"
   or   $entity -> att('name') eq "deleteMe2" ) {
           $entity -> delete; 
   }
}

To delete only the first instance of 'bad2' you can just call findnodes once, and delete the first 'hit'.

Using sed/awk, I need to remove all lines in a file from the first occurrence of pattern1 up-to (but not including) the last occurrence of pattern2

4 Answers