2
votes

I have been making some progress in the development of our little DSL but have run into a problem when trying to highlight the comments in the TextEditorControl we are using. The ICSharpCode control is great by the way and in combination with ANTLR it makes a great platform for DSLs.

I have a working grammer and lexer and have written a Highlighting strategy in the text editor which also works well. The only element of the DSL which is refusing to color correctly is the "Comment" which I have on the hidden channel.

    Comment
  :  '//' ~('\r' | '\n')* {$channel=Hidden;}  
  |  '/*' .* '*/'  {$channel=Hidden;}        
  ;

The frustrating thing is that I can get the highlighting to work if I take the Comment lexrule off the hidden channel...but when I do that the parser stops parsing during evaluation after the last piece of text following the comment.

As an example; this works when comments are hidden but stops parsing at the first "abc" when they are not

   //Comment

   abc=[7,8,9];

   return abc[2];

I have been trying to access the hidden channel separately so that I could perhaps combine the default and hidden token lists into one list ordered by start index and then highlight from there but I'm having no luck using the BaseRecognizer.Hidden parameter for the CommonTokenStream constructor.

My current attempt at highlighting the TextEditor line looks like this

    private void MarkUsingParalexTokens(IDocument document, LineSegment line)
    {
        var text = document.GetText(line).ToLower();
        var input = new ANTLRStringStream(text);
        _lexer.CharStream = input;
        _tokens = new CommonTokenStream(_lexer, BaseRecognizer.Hidden);
        //_tokens.TokenSource =_lexer;


        var wordStart = 0;
        if (_tokens.Count > 1)
        {
            do
            {
                _tokens.Consume();

            } while (_tokens.LastToken.Type != ParalexLexer.EOF);



            var tokenList = _tokens.GetTokens();

            var tokenEnum = tokenList.GetEnumerator();

            var tokenAvailable = tokenEnum.MoveNext();
            if (tokenAvailable)
            {
                for (var i = 0; i < text.Length; i++)
                {
                    var token = tokenEnum.Current;
                    if (token != null)
                    {
                        var c = text[i];
                        if (c == ' ' || c == '\t')
                        {
                            if (i > wordStart)
                                AddWord(document, line, wordStart, i);
                            line.Words.Add(c == ' ' ? TextWord.Space : TextWord.Tab);
                            wordStart = i + 1;
                        }
                        else
                        {
                            var atStartOfToken = (i == token.StartIndex);

                            if (atStartOfToken)
                            {
                                if (i > wordStart)
                                    AddWord(document, line, wordStart, i);

                                var tokenLength = token.StopIndex - token.StartIndex + 1;

                                AddWord(document, line, i, tokenLength, token);
                                tokenEnum.MoveNext();
                                wordStart = i + tokenLength;
                                i = wordStart - 1;
                            }
                        }
                    }

                }

            }
        }

        if (wordStart < line.Length)
                AddWord(document, line, wordStart, line.Length);
    }

    void AddWord(IDocument document, LineSegment line, int startOffset, int length, IToken token = null)
    {
        if (length==0) return;

        var hasSpecialColor = token != null;
        var color = hasSpecialColor ? GetColor(token) : _highlightColors["Default"];

        line.Words.Add(new TextWord(document, line, startOffset, length, color, !hasSpecialColor));
        if (token != null) Debug.WriteLine("From typing: Text {0}, Type {1}, Color {2}", token.Text, token.Type, color);
    }

    private HighlightColor GetColor(IToken token)
    {
        var name = token.Type;
        var groupName = "Default";

        var punctuation = new[]
            {6, 7, 9, 14, 15, 16, 17, 18, 22, 28, 33, 34, 47, 48, 49, 50, 51, 52, 55, 56, 57, 58, 60, 62, 65, 71};
        var paralexVerbs = new[] { 8, 13, 23, 26, 27, 31, 32, 38, 39, 40, 54, 64, 68, 73, 75, 76 };
        var paralexNouns = new[] {11, 12, 42, 43, 59, 66};
        var paralexNumbers = new[] { 53, 61, 41 };
        var paralexStrings = new[] {70};

        if (Array.IndexOf(punctuation, name) >= 0)
        {
            groupName = "Punctuation";
        }
        else if (Array.IndexOf(paralexVerbs, name) >= 0)
        {
            groupName = "ParalexVerbs";
        }
        else if (Array.IndexOf(paralexNouns, name) >= 0)
        {
            groupName = "ParalexNouns";
        }
        else if (Array.IndexOf(paralexNumbers, name) >= 0)
        {
            groupName = "ParalexNumbers";
        }
        else if (Array.IndexOf(paralexStrings, name) >= 0)
        {
            groupName = "ParalexStrings";
        }
        else if (name == 19)
        {
            groupName = "ParalexComment";
        }

        return _highlightColors[groupName];

    }

The do..while seems to be needed to get the tokens into the list otherwise GetTokens never delivers anything. In the form the code is in above no tokens are produced even when entering comments into my test rig.

If I take out the call to the parametrized constructor for the CommonTokenStream and go with a base constructor I get a nice stream of tokens which I can color but all hidden tokens are...well...hidden I guess.

Your collective thoughts on this little problem would be appreciated as well as any ideas you might have on how I can programatically maintain the lists of types rather than having to rejig them every time I change the parser.

I had thought to create independent channels for each type requiring coloring but at the moment I'm only adding recursively to my problem!

Thanks in advance Ian

EDIT:

Thanks for your great answer Sam it's much appreciated. It's marked and scored.

I've gone with the override concept as it also solves the problem of keeping track of the various Token types by name and thus simplifies my maintenance as I add to the grammar.

I have created a syntax highlight lexer and a separate evaluate lexer and used independent channels which I have created in the original grammar.

Comment now looks like this although I think the alt is not working yet, the primary works nicely

Comment
:  '//' ~('\r' | '\n')* 
|  '/*' .* '*/'        
;

Lexer members has these added

        @lexer::members{

    public const int StringChannel = 98;
    public const int NumberChannel = 97;
    public const int NounChannel = 96;
    public const int VerbChannel = 95;
    public const int CommentChannel = 94;

    }

and the highlight lexer uses this override on Emit() your osuggested override is also in place and working

public class HighlightLexer : ParalexLexer
{
    public override IToken Emit()
    {
        switch (state.type)
        {
            case Strng:
                state.channel = StringChannel;
                break;
            case Nmber:
            case Null:
            case Bool:
            case Instrument:
            case Price:
            case PeriodType:
                state.channel = NumberChannel;
                break;
            case BarPeriod:
            case BarValue:
            case InstrumentList:
            case SMA:
            case Identifier:
                state.channel = NounChannel;
                break;
            case Assert:
            case Do:
            case Else:
            case End:
            case Fetch:
            case For:
            case If:
            case In:
            case Return:
            case Size:
            case To:
            case While:
            case T__77:
                state.channel = VerbChannel;
                break;
            case Comment:
                state.channel = CommentChannel;
                break;
            default:
                state.channel = DefaultTokenChannel;
                break;
        }

        return base.Emit();
    }
}

One thing which was bugging me was an apparant inability to get the list of tokens easily. I couldn't get CommonTokenStream to deliver up its tokens without delays and trip-ups. I took a punt with using BufferedTokenStream for "_tokens" as that sounded more like what I was after and hey presto.. tokens! I suspect user error on my part?

The markup methods now looks like this

private void MarkUsingParalexTokens(IDocument document, LineSegment line)
    {
        var text = document.GetText(line).ToLower();
        var input = new ANTLRStringStream(text);
        _lexer.CharStream = input;
        _tokens.TokenSource = _lexer;

        var wordStart = 0;
        var tokenCounter = 1;

        for (var i = 0; i < text.Length; i++)
        {
            var token = _tokens.LT(tokenCounter);
            if (token != null)
            {
                var c = text[i];
                if (c == ' ' || c == '\t')
                {
                    if (i > wordStart)
                        AddWord(document, line, wordStart, i);
                    line.Words.Add(c == ' ' ? TextWord.Space : TextWord.Tab);
                    wordStart = i + 1;
                }
                else
                {
                    var atStartOfToken = (i == token.StartIndex);

                    if (atStartOfToken)
                    {
                        if (i > wordStart)
                            AddWord(document, line, wordStart, i);

                        var tokenLength = token.StopIndex - token.StartIndex + 1;

                        AddWord(document, line, i, tokenLength, token);
                        tokenCounter++;
                        wordStart = i + tokenLength;
                        i = wordStart - 1;
                    }
                }
            }

        }

        if (wordStart < line.Length)
                AddWord(document, line, wordStart, line.Length);
    }

    void AddWord(IDocument document, LineSegment line, int startOffset, int length, IToken token = null)
    {
        if (length==0) return;

        var hasSpecialColor = token != null;
        var color = hasSpecialColor ? GetColor(token) : _highlightColors["Default"];

        line.Words.Add(new TextWord(document, line, startOffset, length, color, !hasSpecialColor));
        if (token != null) Debug.WriteLine("From typing: Text {0}, Type {1}, Color {2}", token.Text, token.Type, color);
    }

    private HighlightColor GetColor(IToken token)
    {
        var name = token.Channel;
        var groupName = "Default";

        if (name==0)
        {
            groupName = "Punctuation";
        }
        else if (name==95)
        {
            groupName = "ParalexVerbs";
        }
        else if (name==96)
        {
            groupName = "ParalexNouns";
        }
        else if (name==97)
        {
            groupName = "ParalexNumbers";
        }
        else if (name==98)
        {
            groupName = "ParalexStrings";
        }
        else if (name == 94)
        {
            groupName = "ParalexComment";
        }

        return _highlightColors[groupName];

    }

Thanks again for your help. I'm off to look at error recognition and markup... Regards Ian

2
Due to a bug in ANTLR 3's CommonTokenStream.SkipOffTokenChannels(int) (also present in the reference Java runtime), CommonTokenStream can currently only be used with the default token channel.Sam Harwell

2 Answers

3
votes

I always use a different lexer for syntax highlighting from the one used for other parsing tasks. The lexer used for syntax highlighting always meets the following:

  • No token except NEWLINE contains a \r or \n character. Instead, multiple lexer modes are used for things like block comments and any other construct which spans multiple lines (this even applies to ANTLR 3 lexers, but without support for lexer modes in ANTLR 3 itself it gets complicated fast).

  • NEWLINE is defined as the following:

    // ANTLR 3-based syntax highlighter:
    NEWLINE : ('\r' '\n'? | '\n') {skip();};
    
    // ANTLR 4-based syntax highlighter:
    NEWLINE : ('\r' '\n'? | '\n') -> skip;
    
  • No token is on the hidden channel.

If you don't want to go this route, you could remove the actions {$channel=Hidden;} from your Comment rule, and instead derive a new class from your base lexer. In the derived class, override Emit(). Use the base implementation for syntax highlighting, and the derived implementation for passing to a parser. This is easier in some cases, but for languages with multi-line strings or comments introduces a substantial performance limitation that we find unacceptable for any of our products.

public override IToken Emit()
{
    if (state.type == Comment)
        state.channel = Hidden;

    return base.Emit();
}
0
votes

I use C++(QT/SCintilla library) but anyway I would recommend to use different Lexer for syntax highlighting. Mine highlighting Lexer differs from the parsing one:

  • no need for context sensitive Lexing ( "X" is a keyword if any only if it is followed by "Y" otherwise is it an identifier

  • the highlighting lexer must never fail

  • I want the built-in functions to be highlighted (this is not needed for parsing)

The Gui Lexer grammar contains additional rules (at the end).

QUOTED_STRING_FRAGMENT
    :    '"' (~('"') | '\"')+ EOF 
    ;

// Last resort rule matches any character. This lexer should never fail.
TOKEN_FAILURE : . ;

The rule TOKEN_FAILURE will match any "invalid" character from user's input and will display it with the red background. Otherwise this character would be skipped and highlighting would be shifted.

QUOTED_STRING_FRAGMENT handles the situation when user enters the 1st quote and a string is not finished yet.