ANTLR4-based lexer loses syntax hightlighting during typing on NetBeans

Question

I've coded a simple lexer and parser using ANTLR4 grammars to make a language plugin for NetBeans 7.3 to help team write more quickly our layout files (a mix of XHTML and widgets definitions also in form of XHTML tags but with custom properties, characteristics, and with some differencies against XHTML syntax).

Template file example:

<div style="dyn_layout_panel">
    @symbol@
    <w_label=label, text="Try to close this window" />
    <w_buttonclose=button, text = "CLOSE", on_press=press_close />
    <w_buttonterminate=button, text="TERMINATE", on_press=press_terminate />
    <w_mydatepicker=datepicker, parent=tab0, ary=[10, "str", /regex/i], start_date=2013-10-05, on_selected=datepicker_selected />
    <w_myeditbox=editbox, parent=tab0, validation=USER_REGEX, validation_regex=/^[0-9]+[a-z]*$/i,
        validation_msg="User regex don't match editbox contents.", on_keyreturn=tab0_editbox_keyreturn />
    <div style="dyn_layout_panel">
        $SYMBOL_2$
        Some text that make a text node.
    </div>
</div>

I use AnltrWorks 2 to write and debug lexer and parser and all seem to be fine, in NetBeans also I don't get any exception and the parser work properly but during editing/typing I lose token colors near the cursor.

Screenshot of problem:

enter image description here

Adding a debug console output for each keystroke I see that the lexer enter in IN_TAG or IN_WIDGET mode correctly, but after a WHITESPACE it returns to the default mode and match te rest of text inside a tag as a TEXT_NODE token.

I know that a lexer can have only one active mode at a time, so because it matches the TEXT_NODE rule when in IN_TAG or IN_WIDGET modes?

Lexer grammar file:

lexer grammar LayoutLexer;

COMMENT
    :   '/*' .*? '*/' -> channel(HIDDEN)
    ;

WS  :   ( ' '
        | '\t'
        | EOL
        )+? -> channel(HIDDEN)
        ;

WDG_START_OPEN : '<w_' PROPERTY -> pushMode(IN_WIDGET) ;
WDG_END_OPEN : '</w_' PROPERTY -> pushMode(IN_WIDGET) ;
TAG_START_OPEN : '<' ATTRIBUTE -> pushMode(IN_TAG) ;
TAG_END_OPEN : '</' ATTRIBUTE -> pushMode(IN_TAG) ;

EXT_REF
    :   ( ('@' REF_NAME '@') | ('$' SYMBOL '$') | ('§' REF_NAME '§') )
    ;

fragment
REF_NAME
    :   ( [a-z]+ [0-9a-z_]*? )
    ;

fragment
EOL :   ( '\r\n' | '\n\r' | '\n' )
    ;

EQUAL
    :   '='
    ;

TEXT_NODE
    :   ( (~('\r'|'\n'|'<'|'@'|'$'|'§'))+ )
    ;

ERROR
    :   ( .+? )
    ;

mode IN_TAG;

TAG_CLOSE : '>' -> popMode ;
TAG_EMPTY_CLOSE : '/>' -> popMode ;

TAG_WS : WS -> type(WS), channel(HIDDEN) ;
TAG_COMMENT : COMMENT -> type(COMMENT), channel(HIDDEN) ;

TAG_EQ : EQUAL -> type(EQUAL) ;

ATTRIBUTE
    :   ( LITERAL [0-9a-zA-Z_]* )
    ;

VAL
    :  ( '"' ( ESC_SEQ | ~('\\'|'"') )*? '"'
    |  '\'' ( ESC_SEQ | ~('\\'|'\'') )*? '\'' )
    ;

TAG_ERR : ERROR -> type(ERROR) ;

mode IN_WIDGET;

WDG_CLOSE : '>' -> popMode ;
WDG_EMPTY_CLOSE : '/>' -> popMode ;

WDG_WS : WS -> type(WS), mode(IN_WIDGET), channel(HIDDEN) ;
WDG_COMMENT : COMMENT -> type(COMMENT), channel(HIDDEN) ;

WDG_EQ : EQUAL -> type(EQUAL), pushMode(WDG_ASSIGN) ;

COMMA
    :   ','
    ;

fragment
MINUS
    :   '-'
    ;

STRING
    :  ( '"' ( ESC_SEQ | ~('\\'|'"') )*? '"'
    |  '\'' ( ESC_SEQ | ~('\\'|'\'') )*? '\'' )
    ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

fragment
HEX_DIGIT
    :   [0-9a-fA-F]
    ;

fragment
DIGIT
    :   [0-9]
    ;

fragment
HEX_NUMBER
    :   '0x' HEX_DIGIT+
    ;

fragment
HTML_NUMBER
    :   (INT_NUMBER | FLOAT_NUMBER) HTML_UNITS
    ;

fragment
FLOAT_NUMBER
    :   MINUS? INT_NUMBER '.' DIGIT+
    ;

fragment
INT_NUMBER
    :   MINUS? DIGIT+
    ;

EVENT_HANDLER
    :   'on_' PROPERTY
    ;

PROPERTY
    :   ( LITERAL [0-9a-zA-Z_]* )
    ;

fragment
LITERAL 
    :   ( LITERAL_U | LITERAL_L )
    ;

fragment
LITERAL_U
    :   [A-Z]+
    ;

fragment
LITERAL_L
    :   [a-z]+
    ;

WDG_ERR : ERROR -> type(ERROR) ;

mode WDG_ASSIGN;

PHP_REF
    : ( LITERAL_L ('_' | LITERAL_L | [0-9])* ) -> popMode
    ;

VALUE : (WDG_VAL | ARRAY) -> popMode;

ASGN_WS : WS -> type(WS), channel(HIDDEN);
ASGN_COMMA : COMMA -> type(COMMA);

ARY_START
    :   '[' 
    ;

ARY_END
    :   ']'
    ;

BIT_OR
    :   '|'
    ;

ARRAY
    :   ARY_START ARY_VALUE (ASGN_COMMA ARY_VALUE)* ARY_END
    ;

fragment
ARY_VALUE : ASGN_WS? WDG_VAL ASGN_WS? -> type(VALUE);

fragment
WDG_VAL
    :   (STRING
    |   UTC_DATE
    |   HEX_NUMBER
    |   HTML_NUMBER
    |   FLOAT_NUMBER
    |   INT_NUMBER
    |   BOOLEAN
    |   BITFIELD
    |   REGEX
    |   CSS_CLASS)
    ;

fragment
HTML_UNITS
    :   ('%'|'in'|'cm'|'mm'|'em'|'ex'|'pt'|'pc'|'px')
    ;

fragment
BOOLEAN
    :   ('true'|'false')
    ;

fragment
BITFIELD
    :   SYMBOL (WS? BIT_OR WS? SYMBOL)*
    ;

SYMBOL
    :   LITERAL_U [0-9A-Z_]*
    ;

UTC_DATE
    :   (DIGIT DIGIT DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT)
    ;

REGEX
    : ('/' ('\\'.|.)*? '/' ('g'|'m'|'i')* )
    ;

CSS_CLASS
    : ( LITERAL_L ('-' | '_' | LITERAL_L | [0-9])* )
    ;

WDG_ASSIGN_ERR : ERROR -> type(ERROR), popMode;

Parser grammar file:

parser grammar LayoutParser;

options 
{
    tokenVocab=LayoutLexer;
    language=Java;
}

document : (element | TEXT_NODE | EXT_REF)* EOF;

element
locals
[
    String currentTag
]
    : ( ( html_open_tag (element | TEXT_NODE | EXT_REF)* html_close_tag )
    | ( wdg_open_tag (element | TEXT_NODE | EXT_REF)* wdg_close_tag )
    | ( html_empty_tag | wdg_empty_tag ) )
    ;

html_empty_tag
    : TAG_START_OPEN (ATTRIBUTE EQUAL VAL)* TAG_EMPTY_CLOSE
    ;

html_open_tag
    : ( tag=TAG_START_OPEN (ATTRIBUTE EQUAL VAL)* TAG_CLOSE )
        {$element::currentTag = $tag.text.substring(1);}
    ;

html_close_tag
    : tag=TAG_END_OPEN TAG_CLOSE
        {
            if (!$element::currentTag.equals($tag.text.substring(2)))
                notifyErrorListeners("HTML tag mismatch '" + $element::currentTag + "' - '" + $tag.text.substring(2) + "'");
        }
    ;

wdg_empty_tag
    : WDG_START_OPEN EQUAL PHP_REF ( COMMA (wdg_prop | wdg_event) )* WDG_EMPTY_CLOSE
    ;

wdg_open_tag
    : tag=WDG_START_OPEN EQUAL PHP_REF ( COMMA (wdg_prop | wdg_event) )* WDG_CLOSE
        {$element::currentTag = $tag.text.substring(1);}
    ;

wdg_close_tag
    : tag=WDG_END_OPEN WDG_CLOSE
        {
            if (!$element::currentTag.equals($tag.text.substring(2)))
                notifyErrorListeners("Widget alias mismatch '" + $element::currentTag + "' - '" + $tag.text + "'");
        }
    ;

wdg_prop
    : PROPERTY (EQUAL (ARRAY | VALUE | PHP_REF | UTC_DATE | REGEX | CSS_CLASS))?
    ;

wdg_event
    : EVENT_HANDLER EQUAL PHP_REF
    ;

Sam Harwell Sam Harwell · Accepted Answer · 2014-05-27T11:35:45

Depending on the implementation of syntax highlighting, the IDE may or may not start at the beginning of the document when lexing the input for syntax highlighting. If it does not start at the beginning of the document, then before returning any tokens, you need to ensure that the lexer instance is initialized in the correct mode (both the _mode and _modeStack fields need to be initialized to their correct state at the point where lexing starts).

If your lexer reads or writes any custom fields during lexing, you may need to restore those fields as well.

Examples

GoWorks (NetBeans based, LGPL License). This implementation does not use the lexer facilities in the NetBeans API, but instead implements the functionality at a lower level. For now you can ignore the MarkOccurrences* and SemanticHighlighter classes.
- package org.tvl.goworks.editor.go.highlighter
- package org.antlr.works.editor.antlr4.highlighting
ANTLR 4 IntelliJ Plugin (IntelliJ IDEA, BSD license).
- package org.antlr.intellij.adaptor.lexer
- package org.antlr.intellij.plugin (in particular, the SyntaxHighlighter classes)

Additional efficiency notes

Your REF_NAME, VAL, and STRING rules use non-greedy loops that do not need to be non-greedy. In each of these rules, change +? to + and change *? to *.
Your WS and ERROR rules use a non-greedy operator +? which is equivalent to not having a closure at all. The unnecessary use of a non-greedy operator in these cases only serves to slow down your lexer. To preserve the existing behavior, you can remove +? from these rules (replacing with + would change behavior).

Additional functionality notes

ANTLR 4 does not perform any error correction during lexing. If the input does not match a token, then the input simply does not match a token. This issue affects your VAL and STRING tokens in particular, which will not get syntax highlighting prior to adding the closing " or ' character. For syntax highlighting these types of tokens, I prefer to use an additional mode in the lexer, allowing me to produce separate tokens for the escape sequences embedded in the string, as well as syntax highlighting an unterminated string at the end of the line (unless your language allows strings to span multiple lines, in which case you'd stop at the end of the input).

ANTLR4-based lexer loses syntax hightlighting during typing on NetBeans

2 Answers

Examples

Additional efficiency notes

Additional functionality notes