8
votes

We all know that you can compile your much-used regular expressions into something that performs very good. But what is this witchcraft happening behind the curtains?

I guess that a finite state automaton gets built there, but you must know better than me.

2
"a finite state automaton gets built there". Correct. What more do you want to know? It really is just that simple.S.Lott
@S.Lott: well, I thought building them was complicated. You could post for instance well known algorithms or something. What would be the modus operandi when someone implements a RE lib?mike3996
Taking capture groups and extended RE capabilities into account, the code generated to implement a regular expression is probably going to be a bit more complex than a finite state automaton.Jim Lewis
Well-known algorithms? Which RE library would you like us to look at for you? You do have the same access to the source that the rest of us do. Further, many advanced CS sites have tutorials on RE operations. I'm unclear what you want to know that you don't already know. You've described it very completely.S.Lott
Moreover, getting into more detail requires abandoning the language-agnostic tag.David Thornley

2 Answers

6
votes

The details of regular expression compilation vary by implementation. For example, compilation in Python or re2 simply creates an instance of a regular expression object. This object's state machine may be modeled as a graph or virtual machine. Without compilation (example: RE.match(expression, input)), a new regular expression object is created behind the scenes each time match is called. This is needless work if you're going to use an expression more than once.

In C#, one of three things can happen when you compile:

  1. A regular expression object is created (implemented as a virtual machine) similar to Python and re2.
  2. A regular expression object is created and its virtual machine opcodes are compiled to in-memory IL instructions on-the-fly.
  3. A regular expression object is created and its virtual machine opcodes are compiled to disk as IL instructions.

You mention an interest in algorithms. Take a look at Russ Cox's excellent articles for two approaches:

1
votes

Compiling a regular expression is similar to compiling Java or Python code; the regular expression is converted to an intermediate representation that the RE engine then interprets in order to perform the appropriate operations upon a string.