0
votes

I am a an Antlr4 newbie and have problems with a relatively simple grammar. The grammar is given at the bottom at the end. (This is a fragment from a grammar for parsing description of biological sequence variants).

I am trying to parse the string "p.A3L" in the following unit test.

@Test
public void testProteinSubtitutionWithoutRef() {
    ANTLRInputStream inputStream = new ANTLRInputStream("p.A3L");
    HGVSLexer l = new HGVSLexer(inputStream);
    HGVSParser p = new HGVSParser(new CommonTokenStream(l));
    p.setTrace(true);
    p.addErrorListener(new BaseErrorListener() {
        @Override
        public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
                int charPositionInLine, String msg, RecognitionException e) {
            throw new IllegalStateException("failed to parse at line " + line + " due to " + msg, e);
        }
    });
    p.hgvs();
}

The test fails with the message "line 1:2 mismatched input 'A3L' expecting AA". I assume that this is related to lexing, i.e. splitting "A3L" into the three tokens A, 3, and L, such that the parser can then generate the corresponding syntax subtree containing the three terminals from it.

What is going wrong here and where can I learn how to fix this?

The grammar

grammar HGVS;

hgvs: protein_var
    ;

// Basix lexemes

AA: AA1
  | AA3
  | 'X';

AA1: 'A'
   | 'R'
   | 'N'
   | 'D'
   | 'C'
   | 'Q'
   | 'E'
   | 'G'
   | 'H'
   | 'I'
   | 'L'
   | 'K'
   | 'M'
   | 'F'
   | 'P'
   | 'S'
   | 'T'
   | 'W'
   | 'Y'
   | 'V';

AA3: 'Ala'
   | 'Arg'
   | 'Asn'
   | 'Asp'
   | 'Cys'
   | 'Gln'
   | 'Glu'
   | 'Gly'
   | 'His'
   | 'Ile'
   | 'Leu'
   | 'Lys'
   | 'Met'
   | 'Phe'
   | 'Pro'
   | 'Ser'
   | 'Thr'
   | 'Trp'
   | 'Tyr'
   | 'Val';

NUMBER: [0-9]+;

NAME: [a-zA-Z0-9_]+;

// Top-level Rule

/** Variant in a protein. */
protein_var: 'p.' AA NUMBER AA
           ;
1

1 Answers

0
votes

There are two problems:

  • Define the rule for protein_var ahead of the lexer rules (should work now to, but is not easy to read because the other parser rule is ahead).
  • Remove the rule for NAME. A3L is not (as you probably expected) AA NUMBER AA but NAME <= ANTLR always prefers the longest matching lexer rule

The resulting grammar should look like:

grammar HGVS;

hgvs
    : protein_var
    ;

protein_var
    : 'p.' AA NUMBER AA
    ;

AA: ...;

AA3: ...;

AA1: ...;

NUMBER: [0-9]+;

If you need NAME for other purposes, you will have to disambiguate it in the lexer (by a prefix that NAMEs and AA do not have in common or by using lexer modes).