0
votes

I am having some difficulties understanding the specific difference between Lexical Grammar and Syntactic Grammar in the ECMAScript 2017 specification.


Excerpts from ECMAScript 2017

5.1.2 The Lexical and RegExp Grammars

A lexical grammar for ECMAScript is given in clause 11. This grammar has as its terminal symbols Unicode code points that conform to the rules for SourceCharacter defined in 10.1. It defines a set of productions, starting from the goal symbol InputElementDiv, InputElementTemplateTail, or InputElementRegExp, or InputElementRegExpOrTemplateTail, that describe how sequences of such code points are translated into a sequence of input elements.

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language.

5.1.4 The Syntactic Grammar

When a stream of code points is to be parsed as an ECMAScript Script or Module, it is first converted to a stream of input elements by repeated application of the lexical grammar; this stream of input elements is then parsed by a single application of the syntactic grammar.


Questions

  1. Lexical grammar
    • Here it says the terminal symbols are Unicode code points (individual characters)
    • It also says it produces input elements (aka. tokens)
    • How are these reconcilable? Either the terminal symbols are tokens, and thus it produces tokens. Or, the terminal symbols are individual code points, and that's what it produces.
  2. Syntactic grammar
    • I have the same questions on this grammar as on the lexical grammar
    • It seems to say that the terminal symbols here are tokens
    • So by applying the syntactic grammar rules, valid tokens are produced, which in turn can be sent to parser? Or, does this grammar accept tokens as input and then test the overall stream of tokens for validity?

My Best Guess

  1. Lexing phase
    • Input: Code points (source code)
    • Output: Applies lexical grammar productions to produce valid tokens (lexeme type + value) as output
  2. Parsing phase
    • Input: Tokens
    • Output: Applies syntactic grammar productions (CFG) to decide if all the tokens together represent a valid stream (i.e. that the source code as a whole is a valid Script / Module)
2
When it says "... has as its terminal symbols Unicode code points...", I think they meant to convey "groupings" of one or more code points as described by the rest of the paragraph. It is a little confusing the way it's written.user2437417
Crazy Train: nope, each terminal symbol is a single Unicode code point.Michael Dyck
@Magnus you really ought to accept Bergi's answer. It's spot on an this question is just sitting out here without the right answer being accepted.Inigo

2 Answers

3
votes

I think you are confused about what terminal symbol means. In fact they are the inputs of the parser, not the outputs (which is a parse tree - including the degenerate case of a list).

On the other hand, a production rule does have indeed terminal symbols as the output and a goal symbol as the input - it's backwards, that's where the term "terminal" comes from. A non-terminal can be expanded (in different ways, that's what the rules describe) to a sequence of terminal symbols.

Example:

Language:
   S -> T | S '_' T
   T -> D | T D
   D -> '0' | '1' | '2' | … | '9'

String:
   12_45

Production:
     S          // start: the goal
   = S '_' T
   = T '_' T
   = T D ' ' T
   = T '2 ' T
   = D '2 ' T
   = '12 ' T
   = '12 ' T D
   = '12 ' T '5'
   = '12 ' D '5'
   = '12_45'     // end: the terminals

Parse tree:
   S
    S
     T
      T
       D
        '1'
      D
       '2'
    ' '
    T
     T
      D
       '4'
     D
      '5'

Parser output (generating a sequence of items from top-level Ts):
   '12'
   '45'

So

  • The lexing phase has code points as inputs and tokens as outputs. The code points are the terminal symbols of the lexical grammar.
  • The syntactic phase has tokens as inputs and programs as outputs. The tokens are the terminal symbols of the syntactic grammar.
1
votes

Your "best guess" is correct to a first approximation. The main correction is to change "tokens" to "input elements". That is, the lexical level produces input elements (only some of which are designated 'tokens'), and the syntactic level takes input elements as input.

The syntactic level can almost ignore input elements that aren't tokens, except that Automatic Semicolon Insertion rules require it to pay attention to line-terminators in whitespace and comments.

Your "How are these reconcilable?" questions seems to stem from a misunderstanding of either "terminal symbol" or "produces", but it's not clear to me which.