4
votes

all: I'm trying to write an antlr parser to parse some text, which is formatted like:

RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA],
RP   PROTEIN SEQUENCE OF 1-22; 2-17;
RP   240-256; 318-339 AND 381-390, AND CHARACTERIZATION.

Basically all lines have a leading 'RP ' to indicate what the line of text is for and the last line should end with a "." to indicate the ending of this type of lines. Also the text can really be anything. What I need in the end is the text.

I wrote an Antlr grammar for this purpose:

grammar RefLine;

rp_line: RP_HEADER RP_TEXT;

RP_HEADER : 'RP   '            -> pushMode(RP_FREE_TEXT_MODE);

mode RP_FREE_TEXT_MODE;
RP_HEADER_SKIP: '\nRP   '      -> skip;
RP_TEXT: .+;
DOT_NEWLINE: '.\n'             -> popMode;

The idea here is when see the first RP_HEADER, it change to the RP_FREE_TEXT_MODE and thus skip any RP_HEADER in between the lines. And when seeing the DOT_NEWLINE, go back to main mode.

This grammar, however, doesn't compile with Antlr 4.1, producing error:

[ERROR] Message{errorType=MODE_NOT_IN_LEXER, args=[RP_FREE_TEXT_MODE, org.antlr.v4.tool.Grammar@5c0662], e=null, fileName='RefLine.g4', line=7, charPosition=5}
[WARNING] Message{errorType=IMPLICIT_TOKEN_DEFINITION, args=[RP_TEXT], e=null, fileName='RefLine.g4', line=3, charPosition=19}

I don't quite understand why the error is produced. Can anyone explain the correct way of using lexer mode in Antlr? Also, is the TOKEN defined in the mode not available for the parser rule?.

EDIT:

As @auselen suggested, I put the the lexer grammer in a separated file RefLineLex.g4:

lexer grammar RefLineLex;

RP_HEADER : 'RP   '            -> pushMode(RP_FREE_TEXT_MODE);

mode RP_FREE_TEXT_MODE;
RP_HEADER_SKIP: '\nRP   '      -> skip;
RP_TEXT: .+;
DOT_NEWLINE: '.\n'             -> popMode;

And in another Combined grammars RefLine.g4 I have:

grammar RefLine;
import RefLineLex;

rp_line: RP_HEADER RP_TEXT ;

Now Antlr compile file but in the RefLineLexer.java it generated:

private void RP_HEADER_action(RuleContext _localctx, int actionIndex) {
        switch (actionIndex) {
        case 0: pushMode(RP_FREE_TEXT_MODE);  break;
        }
    }

the constant: RP_FREE_TEXT_MODE is not defined anywhere in the RefLineLexer.java. Am I still missing something?

1
If all you want to do is extract the concatenated text, Antlr may be overkill. Maybe a simple line-by-line Java program that looks at the first token to select lines and adds text to a StringBuilder until it finds a line ending with a '.'? - Jim Garrison
@JimGarrison, it is just a small part of a big flat file parser.. - Wudong
ask a new question, don't change an existing question - that would just go forever. - auselen

1 Answers

10
votes

Lexer modes are only available in Lexer grammars and not in compound grammars (Lexer + Parser). See Lexer Rules for some poor documentation and check XML Parser implementation at github for an example.

You should have been able to understand this in very informative errorType=MODE_NOT_IN_LEXER message in error prints :)