Is syntax-highlighting programming languages using regular expressions possible?

Question

We all know by now that parsing HTML using regular expressions is not possible in general, since it'd be parsing a context-sensitive grammar while regular expressions can only parse regular grammars. The same is certainly true for other programming languages.

Now, recently, Rainbow.js syntax highlighter has been announced. Its premise is described as very simple:

Rainbow on its own is very simple. It goes through code blocks, processes regex patterns, and wraps matching patterns in tags.

I figured syntax highlighting is essentially a task of the same complexity as language parsing, if we assume it has to be both good and suitable for many languages. Still, while there is quite a bit of criticism of that library, neither that nor the HackerNews discussion (taken as an example for a discussion by technically-inclined) have mentioned that highlighting syntax using regular expressions is basically impossible in a general case, which I'd consider a major, show-stopping flaw.

Now the question is: is there something I'm missing? In particular:

Is syntax highlighting with regular expressions possible in general?
Is this an instance of an applied 80/20 rule, where just enough is possible with regular expressions to be useful?

A.H. A.H. · Accepted Answer · 2012-03-31T13:01:08

Syntax highlighting using regexp is an old art. I think even Emacs and vi started this way.

I figured syntax highlighting is essentially a task of the same complexity as language parsing,[...]

No. The difference is: the compiler needs real parsing because it needs to understand the complete program and also needs to generate stuff from that understanding. Syntax highlighting, on the other hand, does not need to understand the code. It merely needs to understand the general structure of the language - what are string literals - what are keywords ... and so on. A side effect of this difference is: you can highlight code which is syntactically incorrect, but you cannot parse it.

A slightly different approach to this: Parsing a language is often a two-step process: lexing (splitting up the byte stream into a "token" stream) and real parsing (bring the token stream into some complex structure - often an Abstract Syntax Tree). Lexing is usually done using ---- regular expressions. See the flex docs for this. And that's basically all a basic syntax highlighter needs to understand.

Of course there are corner cases which regexp alone cannot catch. A typical example is:

foo(bla, bar);

Here foo might be a call to a static method or an instance method or a macro or something else. But your regexp highlighter cannot deduce this. It can only add colors for a "general call".

So: This is a 100/0 percent rule if your requirements are low-level (i.e. without the above example) and typically a 90/10 rule for real world stuff.

Is syntax-highlighting programming languages using regular expressions possible?

4 Answers