Why Are Evil Regexes A Problem?
Because computers do exactly what you tell them to do, even if it's not what you meant or is totally unreasonable. If you ask a regex engine to prove that, for some given input, there either is or is not a match for a given pattern, then the engine will attempt to do that no matter how many different combinations must be tested.
Here is a simple pattern inspired by the first example in the OP's post:
^((ab)*)+$
Given the input:
abababababababababababab
The regex engine tries something like (abababababababababababab)
and a match is found on the first try.
But then we throw the monkey wrench in:
abababababababababababab a
The engine will first try (abababababababababababab)
but that fails because of that extra a
. This causes catastrophic backtracking, because our pattern (ab)*
, in a show of good faith, will release one of its captures (it will "backtrack") and let the outer pattern try again. For our regex engine, that looks something like this:
(abababababababababababab)
- Nope
(ababababababababababab)(ab)
- Nope
(abababababababababab)(abab)
- Nope
(abababababababababab)(ab)(ab)
- Nope
(ababababababababab)(ababab)
- Nope
(ababababababababab)(abab)(ab)
- Nope
(ababababababababab)(ab)(abab)
- Nope
(ababababababababab)(ab)(ab)(ab)
- Nope
(abababababababab)(abababab)
- Nope
(abababababababab)(ababab)(ab)
- Nope
(abababababababab)(abab)(abab)
- Nope
(abababababababab)(abab)(ab)(ab)
- Nope
(abababababababab)(ab)(ababab)
- Nope
(abababababababab)(ab)(abab)(ab)
- Nope
(abababababababab)(ab)(ab)(abab)
- Nope
(abababababababab)(ab)(ab)(ab)(ab)
- Nope
(ababababababab)(ababababab)
- Nope
(ababababababab)(abababab)(ab)
- Nope
(ababababababab)(ababab)(abab)
- Nope
(ababababababab)(ababab)(ab)(ab)
- Nope
(ababababababab)(abab)(abab)(ab)
- Nope
(ababababababab)(abab)(ab)(abab)
- Nope
(ababababababab)(abab)(ab)(ab)(ab)
- Nope
(ababababababab)(ab)(abababab)
- Nope
(ababababababab)(ab)(ababab)(ab)
- Nope
(ababababababab)(ab)(abab)(abab)
- Nope
(ababababababab)(ab)(abab)(ab)(ab)
- Nope
(ababababababab)(ab)(ab)(ababab)
- Nope
(ababababababab)(ab)(ab)(abab)(ab)
- Nope
(ababababababab)(ab)(ab)(ab)(abab)
- Nope
(ababababababab)(ab)(ab)(ab)(ab)(ab)
- Nope
...
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abababab)
- Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)(ab)
- Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(abab)
- Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)(ab)
- Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)
- Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)
- Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)
- Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)
- Nope
The number of possible combinations scales exponentially with the length of the input and, before you know it, the regex engine is eating up all your system resources trying to solve this thing until, having exhausted every possible combination of terms, it finally gives up and reports "There is no match." Meanwhile your server has turned into a burning pile of molten metal.
How to Spot Evil Regexes
It's actually very tricky. Catastrophic backtracking in modern regex engines is similar in nature to the halting problem which Alan Turing proved was impossible to solve. I have written problematic regexes myself, even though I know what they are and generally how to avoid them. Wrapping everything you can in an atomic group can help to prevent the backtracking issue. It basically tells the regex engine not to revisit a given expression - "lock whatever you matched on the first try". Note, however, that atomic expressions don't prevent backtracking within the expression, so ^(?>((ab)*)+)$
is still dangerous, but ^(?>(ab)*)+$
is safe (it'll match (abababababababababababab)
and then refuse to give up any of it's matched characters, thus preventing catastrophic backtracking).
Unfortunately, once it's written, it's actually very hard to immediately or quickly find a problem regex. In the end, recognizing a bad regex is like recognizing any other bad code - it takes a lot of time and experience and/or a single catastrophic event.
Interestingly, since this answer was first written, a team at the University of Texas at Austin published a paper describing the development of a tool capable of performing static analysis of regular expressions with the express purpose of finding these "evil" patterns. The tool was developed to analyse Java programs, but I suspect that in the coming years we'll see more tools developed around analysing and detecting problematic patterns in JavaScript and other languages, especially as the rate of ReDoS attacks continues to climb.
Static Detection of DoS Vulnerabilities in
Programs that use Regular Expressions
Valentin Wüstholz, Oswaldo Olivo, Marijn J. H. Heule, and Isil Dillig
The University of Texas at Austin