0
votes

Using sed/awk, I need to remove all lines in a file from the first occurrence of pattern1 up-to (but not including) the last occurrence of pattern2.

Consider the following text:

    <entity name="good">
    </entity>
    <entity name="bad">
    stuff to delete
    </entity>
    <entity name="bad">
    stuff to remove
    </entity>
    <entity name="bad2">
    </entity>
    <entity name="deleteMe2">
    </entity>
    <entity name="bad2">
    </entity>
    <entity name="good">
    </entity>

I would like the following outcome

<entity name="good">
</entity>
<entity name="bad2">
</entity>
<entity name="good">
</entity>

I know how to do a range in sed, but can't figure out how to match the last occurrence of 'bad2' and not include it in the delete. The below of course will not work as it will match the first bad2 and not remove the 'deleteme2' or 2nd occurrenc of 'bad2'.

sed -i '/<entity name="bad"/,/<entity name="bad2"/d' file.xml

There can be hundreds of 'bad'/'deleteMe2'/'bad2' lines in the file I am dealing with, so a simple line count won't work. I am fine if this is multiple commands (it does not have to be just a single one), but the more efficient the better because the file being modified can be quite large. As well, the -i is because I want to do an in place delete of the lines between.

NOTE: I am more familiar with SED than I am with AWK, but I am open to all the help I can get:)

4
Looks a lot like XML. Is it XML? Because if so, it's almost certainly better to use a parser.Sobrique
Yes, it is XML, and I totally get not using sed/awk to modify XML, but the XML definition is quite simple in this case. Literally what you see above with some additional text. One constraint I didn't really mention is that I will most likely have to do this on a Windows box, most likely with gnused or gawk. I will consider perl as an option if there is not a way to do what I am asking in sed/awk.Nick Vallely
When you repeat the title in your question, the question is more clear. I skipped the title first and got confused.Walter A
Which sections do you need to delete vs keep? It's not clear from your "outcome", which deletes 2 bad sections, 1 bad2 section, and 1 deleteMe2 section.Brian
@Brian Delete the first line with bad and all following lines until the last bad2 section is completed. Everything in between is bad.Walter A

4 Answers

1
votes

This looks like XML to me, so I would strongly suggest that regex isn't the tool for the job. Use a parser instead:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig -> new -> parsefile ( 'your_file.xml' ) ;
$_ -> delete for $twig -> findnodes ( '//entity[@name="bad"]');
$twig -> set_pretty_print('indented_a');
$twig -> print;

Or perhaps more comprehensively:

for my $entity ( $twig -> findnodes ( '//entity') ) {
   if ( $entity -> att('name') eq "bad"
   or   $entity -> att('name') eq "deleteMe2" ) {
           $entity -> delete; 
   }
}

To delete only the first instance of 'bad2' you can just call findnodes once, and delete the first 'hit'.

1
votes
$ cat tst.awk
NR==FNR {
    if (/"bad"/ && !begFnr) {
        begFnr = FNR
    }
    if (/"bad2"/) {
        endFnr = FNR
    }
    next
}
(FNR < begFnr) || (FNR >= endFnr)

$ awk -f tst.awk file file
<entity name="good">
</entity>
<entity name="bad2">
</entity>
<entity name="good">
</entity>
0
votes

awk to the rescue!

$ awk 'NR==FNR&&/\"bad\"/&&!s{s=NR;next} 
          NR==FNR&&/\"bad2\"/{e=NR;next} 
          NR!=FNR && (FNR<s || FNR>=e)' xml{,}

    <entity name="good">
    </entity>
    <entity name="bad2">
    </entity>
    <entity name="good">
    </entity>

I guess can be simplified further. Two pass script to mark the line numbers first and print the second time.

0
votes

This might work for you (GNU sed):

 sed '/bad/,$!b;/bad2/h;//!H;$!d;g;/bad2/!d' file

Lines that are not between bad and the end of the file, print as normal. Otherwise store those lines in the hold space overwriting those stored lines when matching bad2. Delete all lines but the last, replacing it with the contents of the hold space. Delete the line unless it matches bad2.