2
votes

The second alternative ((1-9)(0-9)) of the following parser rule results in two nodes in the abstract syntax tree.

oneToHundred
  : ('1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9')
  | ('1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9')('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9')
  | '100'
  ;

(side node: "Lexing" the numbers into Digit-Tokens isn't applicable for me, since sometimes a sub-range of 0-9 like 2-4 can represent sth. very different than a digit(which btw I can't influence).)

So for 15 I get two nodes one and five instead of fifteen but I would like to get this as one number represented by one node.

I can not do this with the lexer on the token-level since depending on the context e.g. 15 can mean two very different things either a "one-symbol and a five-symbol" (which definitely should be two nodes) or "fifteen" and according to this post context-sensitivity should be left to the parser.


(Edit for clarification:)

Example for context-sensitivity:

the Input should get split up/is separated by semi-colons

Input:
11;2102;34%;P11o

this would be split into four parts and 
11 - would not be a number but one '1'-symbol and another '1'-symbol
2102 - would not be a number but: '2'-symbol '1'-symbol '0'-symbol '2'-symbol 
34% - now here 34 would be the number thirtyfour
P11o: 'P'-symbol '1'-symbol '1'-symbol 'o'-symbol

Of these four blocks 34% will get recognized as a percent-block by a parser rule and the others as symbol-blocks. So the AST should look sth like this:

SYMBOL
  1
  1
SYMBOL
  2
  1
  0
  2
PERCENT
  34
SYMBOL
  P
  1
  1
  o

The target is C#:

options {
  language=CSharp3;
  output=AST;
}

I'm an Antlr-noob, so is there a good way to merge these two nodes with the parser or am I better of adding an imaginary token and concatenating the two digits "manually" in C# after parsing?

1

1 Answers

1
votes

Your parser rule:

oneToHundred
 : ('1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9')
 | ('1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9')('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9')
 | '100'
 ;

implicitly creates the following tokens behind the scenes:

D_1_9 : ('1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9');
D_0_9 : ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9');
D_100 : '100';

(not with those rule-names, but the contents they match are created)

So, if your lexer will get the input "11", two D_1_9 tokens are created and the 2nd alternative from the oneToHundred rule will not be able to be matched (this alternative needs the two tokens: D_1_9 D_0_9).

You must realize that the lexer operates independently from the parser. It doesn't matter what type of token the parser "asks" of the lexer: the lexer has it's own rule priorities causing '1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9' to never to be matched by the D_0_9 rule (because it comes after the D_1_9 rule).


EDIT

Let's call your input, 11;2102;34%;P11o, four units consisting each of atoms (where an atom is either a letter or digit) possibly ending with a '%':

unit
  :  atoms '%'?
  ;

If it ends with a '%' you simply use a rewrite rule to create a tree with a PERCENT as root, else just create a tree with SYMBOL as root:

unit
  :  (atoms -> ^(/* SYMBOL */)) ('%' -> ^( /* PERCENT */))?
  ;

A working demo:

grammar T;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  ROOT;
  SYMBOL;
  PERCENT;
  NUMBER;
}

parse
  :  unit (';' unit)* EOF -> ^(ROOT unit+)
  ;

unit
  :  (atoms -> ^(SYMBOL atoms)) 
     ('%' -> ^(PERCENT {new CommonTree(new CommonToken(NUMBER, $atoms.text))}))?
  ;

atoms
  :  atom+
  ;

atom
  :  Letter
  |  Digit
  ;

Digit  : '0'..'9';
Letter : 'a'..'z' | 'A'..'Z';

You can test the parser using the following class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    TLexer lexer = new TLexer(new ANTLRStringStream("11;2102;34%;P11o"));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

which will produce DOT-output that corresponds to the following AST:

enter image description here

In the image above, all leaves are of type Letter or Digit, except "34" whose type is NUMBER.