2
votes

I am using antlr4 parsing a text file and I am new to it. Here is the part of the file:

abcdef
//emptyline
abcdef

In file stream string it will be looked like this:

abcdef\r\n\r\nabcdef\r\n

In terms of ANTLR4, it offers the "skip" method to skip something like white-space, TAB, and new line symbol by regular expression while parsing. i.e.

WS : [\t\s\r\n]+ -> skip ; // skip spaces, tabs, newlines

My problem is that I want to skip the empty line only. I don't want to skip every single "\r\n". Therefore it means when there are two or more "\r\n" appear together, I only want to skip the second one or following ones. How should I write the regular expression? Thank you.

grammar INIGrammar_1;
init: (section|NEWLINE)+ ;

section:  '[' phase_name ':' v ']' (contents)+ 
            | '[' phase_name ']' (contents)+ ; 
//
//
phase_name : STRING
            |MTT
            |MPI_GET
            |MPI_INSTALL
            |MPI_DETAILS
            |TEST_GET
            |TEST_BUILD
            |TEST_RUN
            |REPORTER
            ; 
v  : STRING ;      

contents: kvpairs 
          | include_section_pairs
          | if_statement
          | NEWLINE
          | EOT
          ;

keylhs : STRING
        ;
valuerhs : STRING 
          |multiline_valuerhs
          |kvpairs
          |url
          ;
kvpairs: keylhs '=' valuerhs NEWLINE
        ;
include_section_pairs: INCLUDE_SECTION '=' STRING
                    ;
if_statement: IF if_statement_condition THEN NEWLINE (ELSEIF if_statement_condition THEN NEWLINE)*? STRING NEWLINE IFEND NEWLINE
            ;
if_statement_condition:STRING '=' STRING ';'//here, semicolon has problem, either I use ';' or SEMICOLON
                        ;
multiline_valuerhs:STRING (',' (' ')*? ( '\\' (' ')*? NEWLINE)? STRING)+ 
                    ;
url:(' ')*?'http'':''//''www.';//ignore this, not finished.
IF: 'if';
ELSEIF:'elif';
IFEND:'fi';
THEN: 'then';
SEMICOLON: ';';
STRING : [a-z|A-Z|0-9|''| |.|\-|_|(|)|#|&|""|/|@|<|>|$]+ ;

//Keywords
MTT: 'MTT';
MPI_GET: 'MPI get';
MPI_INSTALL:'MPI install';
MPI_DETAILS:'MPI Details';
TEST_GET:'Test get';
TEST_BUILD: 'Test build';
TEST_RUN: 'Test run';
REPORTER: 'Reporter';
INCLUDE_SECTION: 'include_section';
//INCLUDE_SECTION_VALUE:STRING;
EOT:'EOT';

NEWLINE: ('\r' ? '\n')+ ;
WS : [\t]+ -> skip ; // skip spaces, tabs, newlines
COMMENT: '#' .*? '\r'?'\n' -> skip;
EMPTYLINE: '\r\n' -> skip;

Part of the INI file

#======================================================================
# MPI run details
#======================================================================

[MPI Details: Open MPI]

# MPI tests
#exec = mpirun @hosts@ -np &test_np() @mca@ --prefix &test_prefix() &test_executable() &test_argv()
exec = mpirun @hosts@ -np &test_np() --prefix &test_prefix() &test_executable() &test_argv()

hosts = &if(&have_hostfile(), "--hostfile " . &hostfile(), \
            &if(&have_hostlist(), "--host " . &hostlist(), ""))

One more small thing is, it seems like ";" cannot be indicated as itself in result. The ANTLR4 just keep saying it expects something else and treat the semicolon as unknown symbol.

2
I guess a newline marks the end of a construct in your grammar. Why don't you allow empty contructs at the parser level instead? Or, if you have a newline token, you could just consider a newline is really ore or more newlines, for instance NL : [\r\n]+ ; - that's easier.Lucas Trzesniewski
@LucasTrzesniewski Thank you for the comment. Actually I am trying to parse a INI file which uses "\r\n" (in Windows) to be the line separator just like semicolon in JAVA. For the new newline token you mentioned--NL : [\r\n]+ ; which was the first optional way I was using. It showed all "\r\n" symbols in tree nodes which was fine. However currently the requirement has been changed to skip those empty lines only. I am wondering if it is possible to do it this way. If it is impossible I will report them to change the requirement.alvinchen
Hmm... I'm not sure I understand how this doesn't satisfy your requirement. Post your grammar, it'll make your question more clear.Lucas Trzesniewski
@LucasTrzesniewski Thank you. I posted the grammar. It is not finished yet so please ignore some stupid mistakes. :)alvinchen

2 Answers

1
votes

The short answer to your question is that whitespace is not significant to your parser, so skip it all in the lexer.

The longer answer is to recognize that skipping whitespace (or any other character sequence) does not mean that it is not significant in the lexer. All it means is that no corresponding token is produced for consumption by the parser. Skipped whitespace will therefore still operate as a delimiter for generated tokens.

Couple of additional observations:

  1. Antlr does not do regex's - thinking along those lines will lead to further conceptual difficulties.

  2. Don't ignore warnings and errors messages produced in the generation of the Lexer/Parser - they almost always require correction before the generated code will function correctly.

  3. Really helps to verify that the lexer is producing your intended token stream before trying to debug parser rules. See this answer that shows how to dump the token stream.

0
votes

I ran into the same issue trying to have a language that does not require a ; command delimiter. What resolved it for me was adding the new line as a valid parse rule that does nothing. I am no expert on this matter but it worked:

nl : NEWLINE{};

The new line looks like this (no skipping)

NEWLINE:[\r?\n];