3
votes

I am very new to ANTLR and am trying to understand how the Lexer and Parser rules work. I'm experiencing issues with a grammar I've written that seem to be related to lexer tokens with multiple characters being seen as "matches" even when only the first few characters actually match. To demonstrate this, I have written a simple ANTLR 3 Grammar:

grammar test;
options {
    k=3;
}

@lexer::header { package test;}
@header {package test;}

sentence    :   (CHARACTER)*;

CHARACTER   :   'a'..'z'|' ';
SPECIAL     :   'special';

I'm using AntlrWorks to parse the following test input:

apple basic say sponsor speeds speckled specific wonder

The output I get is:

apple basic say nsor ds led ic wonder

It seems to me that the LEXER is using k=1 and therefore matching my SPECIAL token with anything that includes the two letters 'sp'. Once it encounters the letters 'sp', it then matches sucessive characters within the SPECIAL literal until the actual input fails to match the expected token - at which point it throws an error (consuming that character) and then continues with the rest of the sentence. Each error is of the form:

  line 1:18 mismatched chracter 'o' expecting 'e'

However, this isn't the behaviour I'm trying to create. I wish to create a lexer token that matches the keyword ('special') - for use in other parser rules not included in this test example. However, I don't want other rules/input that just happens to include the same initial characters to be affected

To summarize:

  1. How do I actually set antlr 3 options (such as k=2 or k=3 etc)? It seems to me, at least, that the options I'm trying to use here aren't being set.
  2. Is there a better way to create parser or lexer rules to match a particular keyword in my input, without affecting processing of other parts of the input that don't contain a full match?
1
when I run your samples in ANTLRWorks 1.4.2 it parses fine, what output are you referring to?Bas Bossink
That's interesting, @Bas. I'm also using ANTLRWorks 1.4.2 (normally run from within IntelliJ 10.0.3). In debug mode I have an "Output" tab which shows the errors. I can also inspect the parse tree (Parse Tree tab) to see that the "spe.." bits are being dropped. Similarly, the "Stack Tab" shows the issue on my machine. Does yours not show any problems? Can you inspect the ParseTree and see the full sentence, including the bits that start with letters "spec..."?Glennn
I'm using the Interpreter tab it generates a parse tree when I run the sentence rule on your example input.Bas Bossink
@Bas, yes, ANTLRWorks will show a parse tree, but it won't be the one Glennn is looking for. I'm pretty sure you also get error messages on the console in ANTLRWorks.Bart Kiers

1 Answers

4
votes

The k in the options { ... } section defines the look ahead of the parser, not the lexer.

Note that the grammar

CHARACTER   :   'a'..'z'|' ';
SPECIAL     :   'special';

is ambiguous: your 'special' could also be considered as 7 'a'..'z''s. Normally, it'd be lexed as follows:

grammar Test;

sentence : (special | word | space)+ EOF;
special  : SPECIAL;
word     : WORD;
space    : SPACE;

SPECIAL  : 'special';
WORD     : 'a'..'z'+;
SPACE    : ' ';

which will parse the input:

specia special specials

as follows:

enter image description here

I.e. it gets (more or less) tokenized as a combination of LL(1) and "longest-matched". Sorry, I realize that's a bit vague, but the Definitive ANTLR Reference does not clarify this exactly (at least, I can't find it...). But I realize that this might not be what you're looking for.

AFAIK, the only way to produce single char-tokens and define keywords that are made up from these single char-tokens, is done by merging these two tokens in a single rule, and use predicates and manual look-ahead to see if they conform to a key word, and if not, change the type of the token in a "fall through" sub rule. A demo:

grammar test;

tokens {
  LETTER;
}

@lexer::members {
  // manual look ahead
  private boolean ahead(String text) {
    for(int i = 0; i < text.length(); i++) {
      if(input.LA(i+1) != text.charAt(i)) {
        return false;
      }
    }
    return true;
  }
}

sentence
  :  (t=. {System.out.printf("\%-7s :: '\%s'\n", tokenNames[$t.type], $t.text);})+ EOF
  ;

SPECIAL 
  :  {ahead("special")}?=> 'special'
  |  {ahead("keyword")}?=> 'keyword'
  |  'a'..'z' {$type = LETTER;} // Last option and no keyword is found: 
                                // change the type of this token
  ;

SPACE
  :  ' '
  ;

The parser generated from the above grammar can be tested with the class:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("apple basic special speckled keyword keywor");
        testLexer lexer = new testLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        testParser parser = new testParser(tokens);
        parser.sentence();
    }
}

As you can see, when parsing the input:

apple basic special speckled keyword keywor

the following output is generated:

LETTER  :: 'a'
LETTER  :: 'p'
LETTER  :: 'p'
LETTER  :: 'l'
LETTER  :: 'e'
SPACE   :: ' '
LETTER  :: 'b'
LETTER  :: 'a'
LETTER  :: 's'
LETTER  :: 'i'
LETTER  :: 'c'
SPACE   :: ' '
SPECIAL :: 'special'
SPACE   :: ' '
LETTER  :: 's'
LETTER  :: 'p'
LETTER  :: 'e'
LETTER  :: 'c'
LETTER  :: 'k'
LETTER  :: 'l'
LETTER  :: 'e'
LETTER  :: 'd'
SPACE   :: ' '
SPECIAL :: 'keyword'
SPACE   :: ' '
LETTER  :: 'k'
LETTER  :: 'e'
LETTER  :: 'y'
LETTER  :: 'w'
LETTER  :: 'o'
LETTER  :: 'r'

See the Q&A What is a 'semantic predicate' in ANTLR? to learn more about predicates in ANTLR.