0
votes

I'm trying to modify a regex expression I work (I'm using Python 3.6) to work on my test data. You can see for example

str = "< @@@@July 2nd 2018 Idustry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electroni@@@@@@c typesetting,> remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum > <@@@@August 1st 2019 dustry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,> remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset she$$$$$$$ets containing Lorem Ipsum passages, and more rece#####ntly with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum <August 2nd 2019 cently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum > <@@@@August 1st 2019 dustry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scramble#######d it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,> remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum >"

You can see that there's a bunch of fragments separated by angle brackets where each fragment I'm interested in begins with an easily identifiable string in this case @@@ some date and fragment ends in an angle bracket so it's like <@@@@ some date some text that could possibly contain angled brackets > as follows

< @@@@July 2nd 2018 Idustry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting,> remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum >

The problem is sometimes the text following the date contains an angled bracket and since regex is eager will match only partially. Is there a way to prevent this? I wasn't able to successfully use the negative look ahead.

I've tried the following already:

r"<[(?!<@date) >| (?!<@date) < | ^>]+>

In other words, match anything that doesnt follow a <@date including angled brackets < or > if they occur in text, also match any other character.

 pattern = re.compile(r"<[^>]+>")
 return pattern.findall(str)

The actual result is it matches only partially since the regex is eager matches only to the first > or < in the text whereas I'd like to get the entire fragment including the part after > and up to the actual closing angle bracket and beginning of the next fragment (unless its the last fragment then there may not be anything that follows).

1

1 Answers

1
votes

You could match an opening bracket followed by 1+ times an @ and then use a non greedy match.*? until you either encounter the next <@ or the end of the string:

<\s*@+.*?(?=<@|$)

Regex demo | Python demo

Your code might look like:

pattern = re.compile(r"<\s*@+.*?(?=<@|$)", re.MULTILINE)
return pattern.findall(str)

Another way from what I think you meant is to use a tempered greedy token:

<\s*@+(?:(?!<@+).)*>

Regex demo | Python demo