What kind of formal languages can modern regex engines parse?

Question

Here on SO people sometimes say something like "you cannot parse X with regular expressions, because X is not a regular language". From my understanding however, modern regular expressions engines can match more than just regular languages in Chomsky's sense. My questions:

given a regular expression engine that supports

backreferences
lookaround assertions of unlimited width
recursion, like (?R)

what kind of languages can it parse? Can it parse any context-free language, and if not, what would be the counterexample?

(To be precise, by "parse" I mean "build a single regular expression that would accept all strings generated by the grammar X and reject all other strings").

Add.: I'm particularly interested to see an example of a context-free language that modern regex engines (Perl, Net, python regex module) would be unable to parse.

The thing with regex is that, it can be very precise or very loose, but hard to make it behave "just right". This is the case with street HTML, where there are invalid open or close tag. — nhahtdh
This may be better of on Computer Science. By the way, regexps are no grammars; different formalism. — Raphael
A recent article on the subject is: The true power of regular expressions - It's an interesting read, and I think it answers your questions with good examples. — Kobi
@Kobi: Bingo! That post is exactly what I was looking for. Can you make your comment an answer so I can accept it? — georg

NikiC NikiC · Accepted Answer · 2012-07-08T11:06:51

I recently wrote a rather long article on this topic: The true power of regular expressions.

To summarize:

Regular expressions with support for recursive subpattern references can match all context-free languages (e.g a^n b^n).
Regular expressions with lookaround assertions and subpattern references can match at least some context-sensitive languages (e.g. ww and a^n b^n c^n).
If the assertions have unlimited width (as you say), then all context-sensitive grammars can be matched. I don't know any regex flavor though that does not have fixed-width restrictions on lookbehind (and at the same time supports subpattern references).
Regular expressions with backreferences are NP-complete, so any other NP problem can be solved using regular expressions (after applying a polynomial-time transformation).

Some examples:

Matching the context-free language {a^n b^n, n>0}:

/^(a(?1)?b)$/
# or
/^ (?: a (?= a* (\1?+ b) ) )+ \1 $/x

Matching the context-sensitive language {a^n b^n c^n, n>0}:

/^
    (?=(a(?-1)?b)c)
    a+(b(?-1)?c)
$/x
# or
/^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $/x

What kind of formal languages can modern regex engines parse?

3 Answers