Grammar for parsing PHP like language such that it can handle the PHP begin and end tokens (“<?” and “?>”) in the grammar

Question

I am trying to understand how one can define a PHP-like grammar. In PHP, one can get out of PHP mode into HTML mode and then back into PHP mode.

For the sake of asking this question, I am defining my PHP-like language to be ridiculously simple. This language will be referred to as "PHP-like" in the remainder of this question below.

It only contains a single construct: if (expression) { block_list }, i.e. the if-statement. A block_list is a sequence of nested if statements, expressions or HTMLs. Again, to keep the language ridiculously simple, an expression must be an identifier.

Here is an example, that shows a valid code in this language. Here HTML is followed by two nested if statements which is followed by another HTML.

<body><p>Some HTML text here</p>
<?
    if (expression1) {
        if (expression2) {
            expression3
        }
    }
?>
</p>Some more HTML text here</p></body>

Here is another example, that shows how we can get out of PHP-like mode into HTML mode within an if statement.

<?  if (expression1) {     ?>
        some html here
<?      if (expession2) { ?>
            some html here
<?      } 
    }
?>

To implement this, I have a lexer that can identify the following tokens.

HTML       = All characters from the beginning of the code or the last
             occurrence of "?>" to the end of the code or the next
             occurrence of "<?". Zero length string is allowed.

IDENTIFIER = [_a-zA-Z][_a-zA-Z0-9]*  i.e. C identifier, a sequence
                                     of underscores, letters and
                                     digits such that the first
                                     character is not a digit.

WHITESPACE = [ \t\r\n]+              i.e. a sequence of spaces, tabs
                                     and newlines.

BEGIN      = "<?"

END        = "?>"

IF         = "if"

LPAREN     = "("

RPAREN     = ")"

LBRACE     = "{"

RBRACE     = "}"

The lexer outputs each block of HTML (i.e. the stuff outside PHP-like mode) as a token, i.e. an entire HTML block is a single token. It does not output WHITESPACE. It does not output the opening <? and closing ?> in each PHP-like mode, i.e it does not output the first occurrence of BEGIN and the next occurrence of END. As soon as END is reached, anything following it is parsed as HTML again until the next occurrence of BEGIN.

Therefore, for the second code example in this question, the lexer outputs this.

Code:

<?  if (expression1) {     ?>
        some html here
<?      if (expession2) { ?>
            some html here
<?      } 
    }
?>

Lexer output:

HTML        ""
IF          "if" 
LPAREN      "("
IDENTIFIER  "expression1" 
RPAREN      ")"
LBRACE      "{"
HTML        "\n        some html here\n"
IF          "if"
LPAREN      "("
...

Not outputting BEGIN and END tokens keeps the parser grammar simple. Now I can parse these tokens using the following grammar. Since the parser does not have to deal with BEGIN and END tokens, it does not have to mention them anywhere in the grammar. It keeps the grammar simple.

block_list   = block | block_list block;
block        = HTML | if_statement | expression;
if_statement = IF LPAREN expression RPAREN LBRACE block_list RBRACE;
expression   = IDENTIFIER;

However, say, I want to output the BEGIN and END tokens in the lexer. Is there a good way to write a grammar for it such that it takes care of nested if statements that may also contain HTML within them?

I am trying to handle the presence of BEGIN and END tokens in the lexer output in the following grammar but I am unable to come up with a grammar that works.

block_list   = block | block_list block;
block        = HTML | php_like | code;
php_like     = BEGIN code | BEGIN code END;
code         = if_statement | expression;
if_statement = IF LPAREN expression RPAREN LBRACE block_list RBRACE |
               IF LPAREN expression RPAREN LBRACE END block_list RBRACE |
               IF LPAREN expression RPAREN LBRACE END block_list BEGIN RBRACE
expression   = IDENTIFIER;

The above grammar allows both the above code examples in this question. But it also allows the following invalid code.

<?
    if (expression1) {
        <? expression2
    }
?>

I have two questions.

If the lexer outputs the BEGIN and END tokens, how can I write the grammar to handle them?
Is it best not to output the BEGIN and END tokens so that the grammar remains simple?

rici rici · Accepted Answer · 2015-05-27T02:42:21

Assuming that your lexer continues to be stateful, so that a single HTML token will be emitted for the text between END and BEGIN, there is little difference in the grammar.

Aside from the first and last HTML token, every other HTML token will be preceded by END and followed by BEGIN. In other words, we have:

html: END HTML BEGIN;

The slight complication is that we need to deal with the first and last HTML tokens, which means we need a new non-terminal (which will be the start symbol):

program: HTML BEGIN block_list END HTML;

The rest of the grammar is identical to the original, except that HTML becomes html:

block_list   = block | block_list block;
block        = html /* Change is here */ | if_statement | expression;
if_statement = IF LPAREN expression RPAREN LBRACE block_list RBRACE;
expression   = IDENTIFIER;

If your new lexer no longer emits HTML tokens in the case that the associated text is the empty string, then a couple of alternative rules are needed:

program: leading_html block_list trailing_html;
leading_html: HTML BEGIN | BEGIN;
trailing_html: END HTML | END;
html: END HTML BEGIN | END BEGIN;
 /* Remainder as above */

Grammar for parsing PHP like language such that it can handle the PHP begin and end tokens (“<?” and “?>”) in the grammar

1 Answers