0
votes

I am using the below ANTLR grammar for parsing my code.

https://github.com/antlr/grammars-v4/tree/master/cpp

But I am getting a parsing error while using the below code:

TEST_F(TestClass, false_positive__N)
{
  static constexpr char text[] =
    R"~~~(; ModuleID = 'a.cpp'
            source_filename = "a.cpp"

   define private i32 @"__ir_hidden#100007_"(i32 %arg1) {
     ret i32 %arg1
   }

define i32 @main(i32 %arg1) {
   %1 = call i32 @"__ir_hidden#100007_"(i32 %arg1)
   ret i32 %1
}
)~~~";

 NameMock ns(text);
 ASSERT_EQ(std::string(text), ns.getSeed());
}

Error Details:

line 12:29 token recognition error at: '#1'
line 12:37 token recognition error at: '"(i32 %arg1)\n'
line 12:31 missing ';' at '00007_'
line 13:2 missing ';' at 'ret'
line 13:10 mismatched input '%' expecting {'alignas', '(', '[', '{', '=', ',', ';'}
line 14:0 missing ';' at '}'
line 15:0 mismatched input ')' expecting {'alignas', '(', '[', '{', '=', ',', ';'}
line 15:4 token recognition error at: '";\n'

What modification is needed in parser/lexer to parse the input correctly? Any help on this is highly appreciated. Thanks in advance.

2
@rici It looks like valid C++ - al the stuff the grammar is complaining about is inside a raw string literal which was added in C++11. The grammar looks like it supports raw string literals so I don't know why it is complaining.Jerry Jeremiah
@JerryJeremiah: maybe I misinterpreted the snippet: I thought the raw string was being passed to the parser. If the whole snippet is being parsed, then the inadequacy of the lexer's rawstring definition will be the problem.rici
@JerryJeremiah: i guess you are right; the text is being passed to the grammar. The error is entirely consistent with that.rici
I think you are rigjht. The grammar source code says fragment Rawstring: '"' .*? '(' .*? ')' .*? '"'; If the regex is greedy then it works. But consider what that matches if the regex is ungreedy. Have a look here: regex101.com/r/sdLnfw/1 The part it doesn't think is inside the Rawstring is exactly the same as the stuff that the error talks about. The Rawstring definition should ensure that the parts between the paren and quotes match at the front and at the end. You should raise an issue with the repo. I'll see if I can figure out how to modify the Lexer definitionsJerry Jeremiah
Here is a question asking almost exactly the same thing. In their case they wan to match a single arbitrary character that appears at the front and back of their text and you need an arbitrary string of characters. But the prinicple should be the same: stackoverflow.com/questions/38218744/…Jerry Jeremiah

2 Answers

2
votes

Whenever a certain input does not get parsed properly, I start by displaying all the tokens the input is generating. If you do that, you'll probably see why things are going wrong. Another way would be to remove most of the source, and gradually add more lines of code to it: at a certain point the parser will fail, and you have a starting point to solving it.

So if you dump the tokens your input is creating, you'd get these tokens:

Identifier                `TEST_F`
LeftParen                 `(`
Identifier                `TestClass`
Comma                     `,`
Identifier                `false_positive__N`
RightParen                `)`
LeftBrace                 `{`
Static                    `static`
Constexpr                 `constexpr`
Char                      `char`
Identifier                `text`
LeftBracket               `[`
RightBracket              `]`
Assign                    `=`
UserDefinedLiteral        `R"~~~(; ModuleID = 'a.cpp'\n            source_filename = "a.cpp"\n\n   define private i32 @"__ir_hidden#100007_"(i32 %arg1) {\n     ret i32 %arg1\n   }\n\ndefine i32 @main(i32 %arg1) {\n   %1 = call i32 @"__ir_hidden`
Directive                 `#100007_"(i32 %arg1)`
...

you can see that the input R"~~~( ... )~~~" is not tokenised as a StringLiteral. Note that a StringLiteral will never be created because at the top of the lexer grammar there this rule:

Literal:
    IntegerLiteral
    | CharacterLiteral
    | FloatingLiteral
    | StringLiteral
    | BooleanLiteral
    | PointerLiteral
    | UserDefinedLiteral;

causing none of the IntegerLiteral..UserDefinedLiteral to be created: all of them will become Literal tokens. It is far better to move this Literal rule to the parser instead. I must admit that while scrolling through the lexer grammar, it is a bit of a mess, and fixing the R"~~~( ... )~~~" will only delay another lingering problem popping its ugly head :). I am pretty sure this grammar has never been properly tested, and is full of bugs.

If you look at the lexer definition of a StringLiteral:

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R' Rawstring
 ;

fragment Rawstring
 : '"' .*? '(' .*? ')' .*? '"'
 ;

it is clear why '"' .*? '(' .*? ')' .*? '"' will not match your entire string literal:

enter image description here

What you need is a rule looking like this:

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R"' ~[(]* '(' ( . )* ')' ~["]* '"'
 ;

but that will cause the ( . )* to consume too much: it will grab every character and will then backtrack to the last quote in your character stream (not what you want).

What you really want is this:

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R"' ~[(]* '(' ( /* break out of this loop when we see `)~~~"` */ . )* ')' ~["]* '"'
 ;

The break out of this look when we see ')~~~"' part can be done with a semantic predicate like this:

lexer grammar CPP14Lexer;

@members {
  private boolean closeDelimiterAhead(String matched) {
    // Grab everything between the matched text's first quote and first '('. Prepend a ')' and append a quote
    String delimiter = ")" + matched.substring(matched.indexOf('"') + 1, matched.indexOf('(')) + "\"";
    StringBuilder ahead = new StringBuilder();

    // Collect as much characters ahead as there are `delimiter`-chars
    for (int n = 1; n <= delimiter.length(); n++) {
      if (_input.LA(n) == CPP14Lexer.EOF) {
        throw new RuntimeException("Missing delimiter: " + delimiter);
      }
      ahead.append((char) _input.LA(n));
    }

    return delimiter.equals(ahead.toString());
  }
}

...

StringLiteral
 : Encodingprefix? '"' Schar* '"'
 | Encodingprefix? 'R"' ~[(]* '(' ( {!closeDelimiterAhead(getText())}? . )* ')' ~["]* '"'
 ;

...

If you now dump the tokens, you will see this:

Identifier                `TEST_F`
LeftParen                 `(`
Identifier                `TestClass`
Comma                     `,`
Identifier                `false_positive__N`
RightParen                `)`
LeftBrace                 `{`
Static                    `static`
Constexpr                 `constexpr`
Char                      `char`
Identifier                `text`
LeftBracket               `[`
RightBracket              `]`
Assign                    `=`
Literal                   `R"~~~(; ModuleID = 'a.cpp'\n            source_filename = "a.cpp"\n\n   define private i32 @"__ir_hidden#100007_"(i32 %arg1) {\n     ret i32 %arg1\n   }\n\ndefine i32 @main(i32 %arg1) {\n   %1 = call i32 @"__ir_hidden#100007_"(i32 %arg1)\n   ret i32 %1\n}\n)~~~"`
Semi                      `;`
...

And there it is: R"~~~( ... )~~~" properly tokenised as a single token (albeit as a Literal token instead of a StringLiteral...). It will throw an exception when input is like R"~~~( ... )~~" or R"~~~( ... )~~~~", and it will successfully tokenise input like R"~~~( )~~" )~~~~" )~~~"

Quickly looking into the parser grammar, I see tokens like StringLiteral being referenced, but such a token will never be produced by the lexer (as I mentioned earlier).

Proceed with caution with this grammar. I would not advice using it (blindly) for anything other than some sort of educational purpose. Do not use in production!

0
votes

Below changes in Lexer that helped me to resolve the raw string parsing issue

 Stringliteral
   : Encodingprefix? '"' Schar* '"'
   | Encodingprefix? '"' Schar* '" GST_TIME_FORMAT'
   | Encodingprefix? 'R' Rawstring
 ;

fragment Rawstring
 : '"'              // Match Opening Double Quote
   ( /* Handle Empty D_CHAR_SEQ without Predicates
        This should also work
        '(' .*? ')'
      */
     '(' ( ~')' | ')'+ ~'"' )* (')'+)

   | D_CHAR_SEQ
         /*  // Limit D_CHAR_SEQ to 16 characters
            { ( ( getText().length() - ( getText().indexOf("\"") + 1 ) ) <= 16 ) }?
         */
     '('
     /* From Spec :
        Any member of the source character set, except
        a right parenthesis ) followed by the initial D_CHAR_SEQUENCE
        ( which may be empty ) followed by a double quote ".

      - The following loop consumes characters until it matches the
        terminating sequence of characters for the RAW STRING
      - The options are mutually exclusive, so Only one will
        ever execute in each loop pass
      - Each Option will execute at least once.  The first option needs to
        match the ')' character even if the D_CHAR_SEQ is empty. The second
        option needs to match the closing \" to fall out of the loop. Each
        option will only consume at most 1 character
      */
     (   //  Consume everthing but the Double Quote
       ~'"'
     |   //  If text Does Not End with closing Delimiter, consume the Double Quote
       '"'
       {
            !getText().endsWith(
                 ")"
               + getText().substring( getText().indexOf( "\"" ) + 1
                                    , getText().indexOf( "(" )
                                    )
               + '\"'
             )
       }?
     )*
   )
   '"'              // Match Closing Double Quote

   /*
   // Strip Away R"D_CHAR_SEQ(...)D_CHAR_SEQ"
   //  Send D_CHAR_SEQ <TAB> ... to Parser
   {
     setText( getText().substring( getText().indexOf("\"") + 1
                                 , getText().indexOf("(")
                                 )
            + "\t"
            + getText().substring( getText().indexOf("(") + 1
                                 , getText().lastIndexOf(")")
                                 )
            );
   }
    */
 ;

 fragment D_CHAR_SEQ     // Should be limited to 16 characters
    : D_CHAR+
 ;
 fragment D_CHAR
      /*  Any member of the basic source character set except
          space, the left parenthesis (, the right parenthesis ),
          the backslash \, and the control characters representing
           horizontal tab, vertical tab, form feed, and newline.
      */
    : '\u0021'..'\u0023'
    | '\u0025'..'\u0027'
    | '\u002a'..'\u003f'
    | '\u0041'..'\u005b'
    | '\u005d'..'\u005f'
    | '\u0061'..'\u007e'
 ;