Does lookaround affect which languages can be matched by regular expressions?

Question

There are some features in modern regex engines which allow you to match languages that couldn't be matched without that feature. For example the following regex using back references matches the language of all strings that consist of a word that repeats itself: (.+)\1. This language is not regular and can't be matched by a regex that does not use back references.

Does lookaround also affect which languages can be matched by a regular expression? I.e. are there any languages that can be matched using lookaround that couldn't be matched otherwise? If so, is this true for all flavors of lookaround (negative or positive lookahead or lookbehind) or just for some of them?

regular-expressions.info/lookaround.html states that "Lookarounds allow you to create regular expressions that are impossible to create without them, or that would get very longwinded without them". But the only example in that direction is about the impossibility to find and match a q not followed by a u. This says nothing about whether it is possible to tell if the input string contains a q not followed by a u (without having to match only that q). — Christian Semrau
@ChristianSemrau: It might not be a programming question per se, but the requirement is only "programming related" and I think this qualifies. And for me this question is actually interesting from a practical viewpoint as it came up during programming. — sepp2k
@Christian Semrau: My main criterion for "programming related" would be if the question would be at home on a similar accounting site (with obvious simple substitutions). Regexes are pretty strictly a programming thing. I, personally, think it on-topic. — David Thornley
Apparently the question of whether or not CS belongs in stackoverflow has been discussed before: meta.stackexchange.com/questions/26889/… . Personally I hope to see more CS questions here, or perhaps a sister site if necessary. — polygenelubricants

Francis Davey Francis Davey · Accepted Answer · 2010-06-06T22:39:26

The answer to the question you ask, which is whether a larger class of languages than the regular languages can be recognised with regular expressions augmented by lookaround, is no.

A proof is relatively straightforward, but an algorithm to translate a regular expression containing lookarounds into one without is messy.

First: note that you can always negate a regular expression (over a finite alphabet). Given a finite state automaton that recognises the language generated by the expression, you can simply exchange all the accepting states for non-accepting states to get an FSA that recognises exactly the negation of that language, for which there are a family of equivalent regular expressions.

Second: because regular languages (and hence regular expressions) are closed under negation they are also closed under intersection since A intersect B = neg ( neg(A) union neg(B)) by de Morgan's laws. In other words given two regular expressions, you can find another regular expression that matches both.

This allows you to simulate lookaround expressions. For example u(?=v)w matches only expressions that will match uv and uw.

For negative lookahead you need the regular expression equivalent of the set theoretic A\B, which is just A intersect (neg B) or equivalently neg (neg(A) union B). Thus for any regular expressions r and s you can find a regular expression r-s which matches those expressions that match r which do not match s. In negative lookahead terms: u(?!v)w matches only those expressions which match uw - uv.

There are two reasons why lookaround is useful.

First, because the negation of a regular expression can result in something much less tidy. For example q(?!u)=q($|[^u]).

Second, regular expressions do more than match expressions, they also consume characters from a string - or at least that's how we like to think about them. For example in python I care about the .start() and .end(), thus of course:

>>> re.search('q($|[^u])', 'Iraq!').end()
5
>>> re.search('q(?!u)', 'Iraq!').end()
4

Third, and I think this is a pretty important reason, negation of regular expressions does not lift nicely over concatenation. neg(a)neg(b) is not the same thing as neg(ab), which means that you cannot translate a lookaround out of the context in which you find it - you have to process the whole string. I guess that makes it unpleasant for people to work with and breaks people's intuitions about regular expressions.

I hope I have answered your theoretical question (its late at night, so forgive me if I am unclear). I agree with a commentator who said that this does have practical applications. I met very much the same problem when trying to scrape some very complicated web pages.

EDIT

My apologies for not being clearer: I do not believe you can give a proof of regularity of regular expressions + lookarounds by structural induction, my u(?!v)w example was meant to be just that, an example, and an easy one at that. The reason a structural induction won't work is because lookarounds behave in a non-compositional way - the point I was trying to make about negations above. I suspect any direct formal proof is going to have lots of messy details. I have tried to think of an easy way to show it but cannot come up with one off the top of my head.

To illustrate using Josh's first example of ^([^a]|(?=..b))*$ this is equivalent to a 7 state DFSA with all states accepting:

A - (a) -> B - (a) -> C --- (a) --------> D 
Λ          |           \                  |
|          (not a)       \               (b)
|          |              \               | 
|          v                \             v
(b)        E - (a) -> F      \-(not(a)--> G  
|            <- (b) - /                   |
|          |                              |
|         (not a)                         |
|          |                              |
|          v                              |
\--------- H <-------------------(b)-----/

The regular expression for state A alone looks like:

^(a([^a](ab)*[^a]|a(ab|[^a])*b)b)*$

In other words any regular expression you are going to get by eliminating lookarounds will in general be much longer and much messier.

To respond to Josh's comment - yes I do think the most direct way to prove the equivalence is via the FSA. What makes this messier is that the usual way to construct an FSA is via a non-deterministic machine - its much easier to express u|v as simply the machine constructed from machines for u and v with an epsilon transition to the two of them. Of course this is equivalent to a deterministic machine, but at the risk of exponential blow-up of states. Whereas negation is much easier to do via a deterministic machine.

The general proof will involve taking the cartesian product of two machines and selecting those states you wish to retain at each point you want to insert a lookaround. The example above illustrates what I mean to some extent.

My apologies for not supplying a construction.

FURTHER EDIT: I have found a blog post which describes an algorithm for generating a DFA out of a regular expression augmented with lookarounds. Its neat because the author extends the idea of an NFA-e with "tagged epsilon transitions" in the obvious way, and then explains how to convert such an automaton into a DFA.

I thought something like that would be a way to do it, but I'm pleased that someone has written it up. It was beyond me to come up with something so neat.

Does lookaround affect which languages can be matched by regular expressions?

4 Answers