1
votes

I have to parse with antlr4 a text file made up of many Data Blocks, each Data Block have a Data Block Header (one line) and several DataRows, (1..*) lines.

The Data Block Header always stars with '1' located at the first position of the line, followed by several alphanumeric fields.

DataRow is also composed of alphanumeric fields (dataFields), Character '1' can be the first dataField but never located at the fist position of the line.

This is a sample of the input to parse:

1   DataHeaderField1 datafield2 DataBlock1
    DB1_Row1_F1 DB1_Row1_F2    DB1_Row1_F3  DataBlock1
    DB1_Row2_F1 DB1_Row2_F2    DB1_Row2_F3  DataBlock1

1   DataHeaderField1 datafield2 DataBlock2
    DB2_Row1_F1 DB2_Row1_F2    DB2_Row1_F3  DataBlock2
    DB2_Row2_F1 DB2_Row2_F2    DB2_Row2_F3  DataBlock2
    DB2_Row3_F1 DB2_Row3_F2    DB2_Row3_F3  DataBlock2

....

The grammar I tried is:

grammar ReadDataBlocks;
start_parsing: dataBlock+ EOF;
dataBlock: commonHeader  row+;
commonHeader: ONE_AT_FIRST_POS APLHANUMERIC* NL ;
row: APLHANUMERIC+ NL;

ONE_AT_FIRST_POS:   '1' {getCharPositionInLine() == 1}?;

APLHANUMERIC : (LETTER
                |
                DIGIT)+;
DIGIT: [0-9];
LETTER: [a-zA-Z];
NL: '\r'? '\n';
ESPACES : [ \t]+ -> skip;

To parse the file I have deactivated tokens in the lexer as shown in my grammar, by specifying token ONE_AT_FIRST before DIGIT token, so at any time '1' is detected at first postion shall not be parsed as DIGIT.

The problem is that when the parser runs through a '1' located in any other position still identifies as ONE_AT_FIRST_POS throwing the following message:

Output from IntelliJ Idea Antlr plugin

1

1 Answers

1
votes

After running:

public class Main {

    public static void main(String[] args) {

        String source = "1   headerData1 headData2 HeadDataN\n    row1Data Row2Data 1 333 rowNData";
        Lexer lexer = new ReadDataBlocksLexer(CharStreams.fromString(source));
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        tokens.fill();

        for (Token t : tokens.getTokens()) {
            System.out.printf("%-20s `%s`\n", ReadDataBlocksLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
        }
    }
}

I get the following output:

ONE_AT_FIRST_POS     `1`
APLHANUMERIC         `headerData1`
APLHANUMERIC         `headData2`
APLHANUMERIC         `HeadDataN`
NL                   `
`
APLHANUMERIC         `row1Data`
APLHANUMERIC         `Row2Data`
APLHANUMERIC         `1`
APLHANUMERIC         `333`
APLHANUMERIC         `rowNData`
EOF                  `<EOF>`

I think you forgot to regenerate the parser classes after adding the predicate.