I am trying to understand how one can define a PHP-like grammar. In PHP, one can get out of PHP mode into HTML mode and then back into PHP mode.
For the sake of asking this question, I am defining my PHP-like language to be ridiculously simple. This language will be referred to as "PHP-like" in the remainder of this question below.
It only contains a single construct: if (expression) { block_list }
, i.e.
the if-statement. A block_list is a sequence of nested if statements,
expressions or HTMLs. Again, to keep the language ridiculously simple, an
expression must be an identifier.
Here is an example, that shows a valid code in this language. Here HTML is followed by two nested if statements which is followed by another HTML.
<body><p>Some HTML text here</p>
<?
if (expression1) {
if (expression2) {
expression3
}
}
?>
</p>Some more HTML text here</p></body>
Here is another example, that shows how we can get out of PHP-like mode into HTML mode within an if statement.
<? if (expression1) { ?>
some html here
<? if (expession2) { ?>
some html here
<? }
}
?>
To implement this, I have a lexer that can identify the following tokens.
HTML = All characters from the beginning of the code or the last
occurrence of "?>" to the end of the code or the next
occurrence of "<?". Zero length string is allowed.
IDENTIFIER = [_a-zA-Z][_a-zA-Z0-9]* i.e. C identifier, a sequence
of underscores, letters and
digits such that the first
character is not a digit.
WHITESPACE = [ \t\r\n]+ i.e. a sequence of spaces, tabs
and newlines.
BEGIN = "<?"
END = "?>"
IF = "if"
LPAREN = "("
RPAREN = ")"
LBRACE = "{"
RBRACE = "}"
The lexer outputs each block of HTML (i.e. the stuff outside PHP-like mode) as a token, i.e. an entire HTML block is a single token. It does not output WHITESPACE. It does not output the opening <?
and closing ?>
in each PHP-like mode, i.e it does not output the first occurrence of BEGIN and the next occurrence of END. As soon as END is reached, anything following it is parsed as HTML again until the next occurrence of BEGIN.
Therefore, for the second code example in this question, the lexer outputs this.
Code:
<? if (expression1) { ?>
some html here
<? if (expession2) { ?>
some html here
<? }
}
?>
Lexer output:
HTML ""
IF "if"
LPAREN "("
IDENTIFIER "expression1"
RPAREN ")"
LBRACE "{"
HTML "\n some html here\n"
IF "if"
LPAREN "("
...
Not outputting BEGIN and END tokens keeps the parser grammar simple. Now I can parse these tokens using the following grammar. Since the parser does not have to deal with BEGIN and END tokens, it does not have to mention them anywhere in the grammar. It keeps the grammar simple.
block_list = block | block_list block;
block = HTML | if_statement | expression;
if_statement = IF LPAREN expression RPAREN LBRACE block_list RBRACE;
expression = IDENTIFIER;
However, say, I want to output the BEGIN and END tokens in the lexer. Is there a good way to write a grammar for it such that it takes care of nested if statements that may also contain HTML within them?
I am trying to handle the presence of BEGIN and END tokens in the lexer output in the following grammar but I am unable to come up with a grammar that works.
block_list = block | block_list block;
block = HTML | php_like | code;
php_like = BEGIN code | BEGIN code END;
code = if_statement | expression;
if_statement = IF LPAREN expression RPAREN LBRACE block_list RBRACE |
IF LPAREN expression RPAREN LBRACE END block_list RBRACE |
IF LPAREN expression RPAREN LBRACE END block_list BEGIN RBRACE
expression = IDENTIFIER;
The above grammar allows both the above code examples in this question. But it also allows the following invalid code.
<?
if (expression1) {
<? expression2
}
?>
I have two questions.
- If the lexer outputs the BEGIN and END tokens, how can I write the grammar to handle them?
- Is it best not to output the BEGIN and END tokens so that the grammar remains simple?