1
votes

I'm trying to implement syntax highlighting for Java code using this ANTLR grammar. My strategy has been to parse the code into an tree with that grammar, and then use a visitor to go through each terminal in the tree and assign its corresponding text a color. This color is usually just the color associated with the terminal's token, but can be overridden depending on context. For example, consider this screenshot from VSCode:

By default, identifiers are colored white. However, if they are known to refer to classes/methods, then they are colored green. I would like to make a similar distinction in my visitor by labelling identifiers white by default, but overriding that with green for classes/methods.

So far, I have been successful in implementing for this for class/method declarations. The production rule for classDeclaration looks like this:

classDeclaration
    :   'class' Identifier typeParameters?
        ('extends' typeType)?
        ('implements' typeList)?
        classBody
    ;

Here, Identifier is a terminal, while all of the other nonliterals are nonterminals. My strategy was to color every child terminal with an overridable token with green (1). By that last term, it is something I have invented in my codebase to deal with this problem. Essentially, keywords should always have the same color no matter the context, so their tokens are not overridable. Identifiers' color depends on context, so they have a default (white) but you can make them green. The only terminals in the above production are 'class', Identifier, 'extends', and 'implements'. The first and the last two are keywords and not overridable, so following procedure (1) colors only the class name green.

Here is the C# code I used to implement the above strategy.

Unfortunately, this strategy appears to be problematic when attempting to highlight method invocations, such as blah.blah() above. Here is the production rule for an expression:

expression
    :   primary
    |   expression '.' Identifier
    |   expression '.' 'this'
    |   expression '.' 'new' nonWildcardTypeArguments? innerCreator
    |   expression '.' 'super' superSuffix
    |   expression '.' explicitGenericInvocation
    |   expression '[' expression ']'
    |   expression '(' expressionList? ')'
    |   // Lots of other stuff
    ;

This means that foo.bar() parses as (('foo') '.' 'bar') '(' ')'. If, for all expressions, I color all Identifier children green, then foo.bar() will have foo white and bar green as intended. (Note that foo is a primary, and its terminal is not the direct child of an expression.) However, foo.bar also has foo white and bar green, which does not match the behavior of VSCode above.

I attempted to work around this by creating a new production for expressions that look like expression '.' Identifier '(' expressionList? ')' and referencing that from expression.

expression
    :   // ...
    |   expression '[' expression ']'
    |   invocationExpression
    |   // ...
    ;

invocationExpression
    :   expression '.' Identifier '(' expressionList? ')'
    |   expression '(' expressionList? ')'
    ;

Then, I would be able to run procedure (1) against invocationExpressions in my visitor, coloring all child Identifiers green, which would make foo.bar() white-green and foo.bar white-white as intended. However, ANTLR is complaining because expression and invocationExpression are mutually left-recursive. How do I overcome this, or is there a different approach to solve this problem?

2

2 Answers

1
votes

As far as I can see it you are only creating the extra rule in order for it to produce another Token so that your code knows that there's a method call in progress.

In order to do so you don't have to create a new rule. You can use lables instead. Basically that means giving each alternative in a rule a different label so that each alternative will create it's own Token. Furthermore there will be extra enter and exit methods created by ANTLR for every alternative.
Here you can find the description of these labels on the ANTLR GitHub page.

1
votes

You should separate the 2 aspects into individual steps and not try to solve it in one go. What you need first is a symbol table, something that holds info about your syntax entities (like class names, var names, constants etc.). You can create this when parsing your input when something changed. This is totally isolated.

When your editor wants to tokenize the input (using a lexer, nothing more!) you can then lookup in your symbol table if an identifier you found is a known entity name and change the color accordingly.