There are so many programming languages which support the inclusion of mini-languages. PHP is embedded within HTML. XML can be embedded within JavaScript. Linq can be embedded within C#. Regular expressions can be embedded in Perl.
// JavaScript example
var a = <node><child/></node>
Come to think of it, most programming languages can be modeled as different mini-languages. Java, for example, could be broken into a least four distinct mini-languages:
- A type-declaration langauge (package directive, import directives, class declaration)
- A member-declaration language (access modifiers, method declarations, member vars)
- A statement language (control flow, sequential execution)
- An expression language (literals, assignments, comparisons, arithmetic)
Being able to implement those four conceptual languages as four distinct grammars would certainly cut down on a lot of the spaghettiism that I usually see in complex parser and compiler implementations.
I've implemented parsers for various different kinds of languages before (using ANTLR, JavaCC, and custom recursive-descent parsers), and when the language gets really big and complex, you usually end up with one huuuuuuge grammar, and the parser implementation gets really ugly really fast.
Ideally, when writing a parser for one of those languages, it'd be nice to implement it as a collection of composable parsers, passing control back and forth between them.
The tricky thing is that often, the containing langauge (e.g., Perl) defines its own terminus sentinel for the contained language (e.g., regular expressions). Here's a good example:
my $result ~= m|abc.*xyz|i;
In this code, the main perl code defines a nonstandard terminus "|" for the regular expression. Implementing the regex parser as completely distinct from the perl parser would be really hard, because the regex parser wouldn't know how to find the expression terminus without consulting the parent parser.
Or, lets say I had a language which allowed the inclusion of Linq expressions, but instead of terminating with a semicolon (as C# does), I wanted to mandate the Linq expressions appear within square brackets:
var linq_expression = [from n in numbers where n < 5 select n]
If I defined the Linq grammar within the parent language grammar, I could easily write an unambiguous production for a "LinqExpression" using syntactic lookahead to find the bracket enclosures. But then my parent grammar would have to absorb the whole Linq specification. And that's a drag. On the other hand, a separate child Linq parser would have a very difficult time figuring out where to stop, because it would need to implement lookahead for foreign token-types.
And that would pretty much rule out using separate lexing/parsing phases, since the Linq parser would define a whole different set of tokenization rules than the parent parser. If you're scanning for one token at a time, how do you know when to pass control back to the lexical analyzer of the parent language?
What do you guys think? What are the best techniques available today for implementing distinct, decoupled, and composable language grammars for the inclusion of mini-languages within larger parent langauges?