Regex - nested lookahead assertion

Question

suppose we want to match all one(s) between <out>...</out> in this text (option: dot matches all):

<out>hello!</out>
<nx1>home one</nx1>
<nx2>living</nx2>
<out>one text
text one continues 
and at last here ends one</out>
<m2>dog one</m2>
<out>bye!</out>

let's say we use this pattern:

one(?=(?:(?!<out>).)*</out>)

I really appreciate it if someone explains how regex engine process that pattern step-by-step and where it would be(position in the original text) in every phase of processing;(something like accepted @Tim Pietzcker's helpful explanation for this question: Regex - lookahead assertion)

The provided pattern will not match all one(s) between out tags. eg. the third line which contains multiple one(s) — Dhrubajyoti Gogoi

Robin Robin · Accepted Answer · 2014-05-29T09:01:31

Many tools exist to automatically explain what your regex does, character by character.

The idea behind it, though, is that you want to check one is followed by </out> while forbidding to enter a new out tag: if there's a ...</out> following and we haven't entered a new <out>...</out> structure, we know we are inside one already.

So the regex will match one if it is followed by </out> and if there's no <out> between the two.

The work is done by (?:(?!<out>).)*: the . matches only if it is not the first < in <out>. So we can go up to </out> only by consuming characters that are not this < followed by out>.

A speed improvement would be:

one(?=(?:[^<]*+|<(?!out>))*+</out>)

Stepping inside the negative lookahead at each character greatly increases the cost of matching this character. Here [^<]*+ will match directly up to the next suspicious <, and we perform the negative look ahead check only when we have to.

Regex - nested lookahead assertion

2 Answers