0
votes

I try to write a simple ANTLR4 grammar for parsing SRT subtitles files. I thought it will be an easy, introductory task, but I guess I must miss some point. But first things first --- the grammar:

grammar Srt;

file    :   subtitle (NL NL subtitle)* EOF;

subtitle:   SUBNO NL
            TSTAMP ' --> ' TSTAMP NL
            LINE (NL LINE)*;

TSTAMP  :   I99 ':' I59 ':' I59 ',' I999;
SUBNO   :   D09+;
NL      :   '\r'? '\n';
LINE    :   ~('\r'|'\n')+;

fragment I999   :   D09 D09 D09;
fragment I99    :   D09 D09;
fragment I59    :   D05 D09;
fragment D09    :   [0-9];
fragment D05    :   [0-5];

And here's a beginning of a SRT file where the problem stars:

1
00:00:20,000 --> 00:00:26,000

The error I get is:

line 2:0 mismatched input '00:00:20,000 --> 00:00:26,000' expecting TSTAMP

So it looks like the second line applied to the lexer rule LINE (as this is the longest token it could have been matched), however what I expect is to match the rule TSTAMP (and that's why it's defined before LINE rule in the grammar). My ANTLR4 knowledge is to weak at this point to tweak the grammar in a way, that lexer could try to match a subset on tokens depending on current position in parser rule. What I intend to achieve is to match TSTAMP and not LINE, as TSTAMP is in fact expected input. Maybe I could trick it with some lexer modes, but I can hardly believe it couldn't be written in a simpler way. Can it?


As CoronA suggested the trick was to defer the decision for LINE rule to the parser and this was the clue. I modified the grammar a bit more and now it parser subtitles smoothly:

grammar Srt;

file    :   subtitle (NL NL subtitle)* EOF;

subtitle:   SUBNO NL
            TSTAMP ' --> ' TSTAMP NL
            lines;

lines   :   line (NL line)*;
line    :   (LINECHAR | SUBNO | TSTAMP)*;

TSTAMP  :   I99 ':' I59 ':' I59 ',' I999;
SUBNO   :   D09+;
NL      :   '\r'? '\n';
LINECHAR:   ~[\r\n];

fragment I999   :   D09 D09 D09?;
fragment I99    :   D09 D09;
fragment I59    :   D05 D09;
fragment D09    :   [0-9];
fragment D05    :   [0-5];
1

1 Answers

2
votes

Your definition of the token LINE subsumes everything:

LINE    :   ~('\r'|'\n')+;

Each TSTAMP is also a LINE but a line can match longer lexems. And it does as you can see. ANTLR prefers longest matches.

To make your grammar work, transfer the decision what a line is from the lexer into the parser:

subtitle:   SUBNO NL
            TSTAMP ' --> ' TSTAMP NL
            line*;

line:   (LINECHAR | TSTAMP | SUBNO)* NL?;

...

LINECHAR    :   ~('\r'|'\n' ) ; //remove the '+'

You can see that a line may contain any LINE_CHAR but also TSTAMPs and SUBNOs.