antlr4: need to convert sequences of symbols to characters in lexer

Question

I am writing a parser for Wolfram Language. The language has a concept of "named characters", which are specified by a name delimited by \[, and ]. For example: \[Pi].

Suppose I want to specify a regular expression for an identifier. Identifiers can include named characters. I see two ways to do it: one is to have a preprocessor that would convert all named characters to their unicode representation, and two is to enumerate all possible named characters in their source form as part of the regular expression.

The second approach does not seem feasible because there are a lot of named characters. I would prefer to have ranges of unicode characters in my regex.

So I want to preprocess my token stream. In other words, it seems to me that the lexer needs to check if the named characters syntax is correct and then look up the name and convert it to unicode.

But if the syntax is incorrect or the name does not exist I need to tell the user about it. How do I propagate this error to the user and yet let antlr4 recover from the error and resume? Maybe I can sort of "pipe" lexers/parsers? (I am new to antlr).

EDIT:

In Wolfram Language I can have this string as an identifier: \[Pi]Squared. The part between brackets is called "named character". There is a limited set of named characters, each of which corresponds to a unicode code point. I am trying to figure out how to tokenize identifiers like this.

I could have a rule for my token like this (simplified to just a combination of named characters and ASCII characters):

NAME : ('\\[' [a-z]+ ']'|[a-zA-Z])+ ;

but I would like to check if the named character actually exists (and other attributes such as if it is a letter, but the latter part is outside of the scope of the question), so this regex won't work.

I considered making a list of allowed named characters and just making a long regex that enumerates all of them, but this seems ugly.

What would be a good approach to this?

END OF EDIT

GRosenberg GRosenberg · Accepted Answer · 2015-07-19T03:37:44

A common approach is to write the lexer/parser to allow syntactically correct input and defer semantic issues to the analysis of the generated parse tree. In this case, the lexer can naively accept named characters:

NChar : NCBeg .? RBrack ;

fragment NCBeg : '\\[' ;
fragment LBrack: '['   ;
fragment RBrack: ']'   ;

Update

In the parser, allow the NChar's to exist in the parse-tree as discrete terminal nodes:

idents : ident+ ;
ident  : NChar   // named character string
       | ID      // simple character string?
       | Literal // something quoted?
       | ....
       ;

This makes analysis of the parse tree considerably easier: each ident context will contain only one non-null value for a discretely identifiable alt; and isolates analysis of all ordering issues to the idents context.

Update2

For an input \[Pi]Squared, the parse tree form that would be easiest to analyze would be an idents node with two well-ordered children, \[Pi] and Squared.

Best practice would not be to pack both children into the same token - would just have to later manually break the token text into the two parts to check if it is contains a valid named character and whether the particular sequence of parts is allowable.

No regex is going to allow conclusive verification of the named characters. That will require a list. Tightening the lexer definition of an NChar can, however, achieve a result equivalent to a regex:

NChar : NCBeg [A-Z][A-Za-z]+ RBrack ;

If the concern is that there might be a space after the named character, consider that this circumstance is likely better treated with a semantic warning as opposed to a syntactic error. Rather than skipping whitespace in the lexer, put the whitespace on the hidden channel. Then, in the verification analysis of each idents context, check the hidden channel for intervening whitespace and issue a warning as appropriate.

----

A parse-tree visitor can then examine, validate, and warn as appropriate regarding unknown or misspelled named characters.

To do the validation in the parser, if more desirable, use a predicated rule to distinguish known from unknown named characters:

@members {
    ArrayList<String> keyList = .... // list of named chars

    public boolean inList(String id) {
        return keyList.contains(id) ;
    }
}

nChar    : known
         | unknown
         ;

known    : NChar { inList($NChar.getText()) }?             ;
unknown  : NChar { error("Unknown " + $NChar.getText()); } ;

The inList function could implement a distance metric to detect misspellings, but correcting the text directly in the parse-tree is a bit complex. Easier to do when implemented as a parse-tree decoration during a visitor operation.

Finally, a scrape and munge of the named characters into a usable map (both unicode and ascii) is likely worthwhile to handle both representations as well as conversions and misspelling.

antlr4: need to convert sequences of symbols to characters in lexer

1 Answers