13
votes

We all know by now that parsing HTML using regular expressions is not possible in general, since it'd be parsing a context-sensitive grammar while regular expressions can only parse regular grammars. The same is certainly true for other programming languages.

Now, recently, Rainbow.js syntax highlighter has been announced. Its premise is described as very simple:

Rainbow on its own is very simple. It goes through code blocks, processes regex patterns, and wraps matching patterns in tags.

I figured syntax highlighting is essentially a task of the same complexity as language parsing, if we assume it has to be both good and suitable for many languages. Still, while there is quite a bit of criticism of that library, neither that nor the HackerNews discussion (taken as an example for a discussion by technically-inclined) have mentioned that highlighting syntax using regular expressions is basically impossible in a general case, which I'd consider a major, show-stopping flaw.

Now the question is: is there something I'm missing? In particular:

  1. Is syntax highlighting with regular expressions possible in general?
  2. Is this an instance of an applied 80/20 rule, where just enough is possible with regular expressions to be useful?
4

4 Answers

15
votes

Syntax highlighting using regexp is an old art. I think even Emacs and vi started this way.

I figured syntax highlighting is essentially a task of the same complexity as language parsing,[...]

No. The difference is: the compiler needs real parsing because it needs to understand the complete program and also needs to generate stuff from that understanding. Syntax highlighting, on the other hand, does not need to understand the code. It merely needs to understand the general structure of the language - what are string literals - what are keywords ... and so on. A side effect of this difference is: you can highlight code which is syntactically incorrect, but you cannot parse it.

A slightly different approach to this: Parsing a language is often a two-step process: lexing (splitting up the byte stream into a "token" stream) and real parsing (bring the token stream into some complex structure - often an Abstract Syntax Tree). Lexing is usually done using ---- regular expressions. See the flex docs for this. And that's basically all a basic syntax highlighter needs to understand.

Of course there are corner cases which regexp alone cannot catch. A typical example is:

foo(bla, bar);

Here foo might be a call to a static method or an instance method or a macro or something else. But your regexp highlighter cannot deduce this. It can only add colors for a "general call".

So: This is a 100/0 percent rule if your requirements are low-level (i.e. without the above example) and typically a 90/10 rule for real world stuff.

3
votes

You can do syntax highlighting using regular expressions as part of the solution. More specifically, as part of the "lexer" that chunks the input text into symbols. This is actually the way most compilers/interpreters work.

To do it using regex alone, though, is asking for trouble. Consider the case of matching a string in Python. Python allows strings to be delimited by either single quotes ' or double quotes ". Additionally, it allows multi-line strings ("heredoc syntax") using triple-quotes, ''' or """.

So which parts of the following are strings, and which aren't? Can you construct a regular expression that correctly identifies the string literals str1-str6?

str1 = "hello, world!"

str2 = 'hello, world!'

str3 = "The canonical test program is 'Hello World'."

str4 = '"Why," Peter said, "That\'s ludicrous. Who would do that?"'

str5 = """The heredoc syntax is handy for cases where you don't want to escape strings. "Very convenient."
"""

str6 = """Code sample:
s1 = "hi!"
s2 = 'Hi!'
S3 = '''
- apples
- oranges
- bananas
'''
"""

The argument that "you can't (parse HTML|process programs) with regex because (HTML|programming languages) have nested structures - they aren't regular" isn't quite true - modern regular expressions (particularly in Perl) have more expressive power than strictly regular expressions in the computer-science sense. But just because you can use regular expressions doesn't mean you should.


Edit: the string-matching problem above isn't too bad if your regex flavour supports backreferences in the search pattern. A multi-line regex like ('|"|'''|""").+?\1 would probably do.


Edit 2: For an example of the corner cases in syntax hilighting, look no further than StackOverflow's syntax highlighting of the code above.

2
votes

Basically, no.

You need a parser/tokenizer that understands the language to pick out which bits to highlight.

Regex doesn't cut the mustard for such a task.

0
votes

A good example to look at is the syntax highlighting implementation in Vim. It uses patterns which are regular-expresion based. However, the patterns are used to recognize hierarchical containment structures in the document, and not simply to tokenize the input.

You can declare regions which begin and end on a regex pattern match (plus another pattern that helps skip the middle material). These regions can declare that they contain other regions or simple patterns. Containment can be recursive. Vim works all of that out. So it is essentially a form of context-free parsing.

This approach can handle languages which have various levels of embedding, with different lexical properties.

For example, I have a language in which there are essentially two sets of keywords (due to a domain language embedding that is going on). The Vim syntax highlighting rules that I wrote recognize the context properly and colorize the keywords differently. Note that there is some overlap between these sets of keywords: same word, different meaning in a different context.

For an example of this see: http://www.kylheku.com/cgit/txr/tree/genman.txr. If you search for the syntax (do you will find that one instance is colored purple, and another green. They are different: one is in a text extraction language, and another in an embedded Lisp dialect. Vim's syntax highlighting is powerful enough to handle a mixture of languages with different sets of keywords. (Yes, though this is served over the web, it's actually a Vim process doing the syntax highlighting.)

Or consider something like the shell, where you can have a string literal type syntax, like "foo bar", but inside there, you can have a command substitution, inside which you have to recursively recognize and colorize shell syntax: "foo $(for x in *; do ...; done) bar".

So no, you can't do useful, accurate syntax higlighting just with regex tokenizing, but regexes with hierarchical parsing can do a good job.