Flex & Bison : Is it possible to parse Raw string literal using Backreference in Flex&Bison?

Question

I tried to parse a raw string literal using Flex&Bison. But unable to parse it.

Input:

R"foo(Hello World)foo"

Flex:

...
raw_string     R"([^\(]*)\(([^\)]*)\)\1" 

%%
{raw_string}   {return raw_string;}
%%

Bison:

%{
...
%}

%token raw_string
%start run

%%
run : raw_string;

%%

int main()
{
  yyin = stdin;
  do
  {
    yyparse();
  } while(!feof(yyin));
  return =0;
}

Error:

 invalid Syntax: at raw_string

Kindly help me to parse the raw string literal using Flex and Bison. If back referenceing is not possible is there any alternative way to parser raw string in flex and bison.

rici rici · Accepted Answer · 2018-02-15T19:24:11

Flex patterns have neither backreferences nor non-greedy matches, and you would require both of those in order to correctly identify a raw string with a "regular" expression. [Note 1]

However, flex does have a feature -- "start conditions" -- which makes it very simple to implement these features, at least in cases like this one where you never need to backtrack to try a different backreferenced subpattern.

A start condition is a way of using a different set of rules for a given lexical context; in this case, the context inside the raw string.

Note: The original code had a bug which would cause it to miss a raw string close sequence if the raw string were immediately followed by a few non-whitespace characters and another double quote. That's a pretty unlikely scenario, which perhaps explains why no-one noticed the bug. I fixed it by removing all the attempted cleverness. Apologies to anyone who copied the buggy code.

So a skeleton solution for (simplified) C++ raw strings would be:

%x C_RAW_STRING
dchar_seq [^()\\[:space:]]{0,16}
%%
   size_t delim_len;
R["]{dchar_seq}[(]    { delim_len = yyleng - 3;
                        yymore();
                        BEGIN(C_RAW_STRING);
                      }
R["]{dchar_seq}       { yyerror("Invalid raw string opener"); }

   /* Rules for other tokens omitted */

[[:space:]]+          ;                /* Ignore whitespace */
.                     return *yytext;  /* Fallback rule */
<C_RAW_STRING>{
   [^"]+              yymore();
   ["]                { if (yytext[yyleng - (delim_len + 2)] == ')' &&
                            memcmp(yytext + yyleng - (delim_len + 1),
                                   yytext + 2, delim_len) == 0) {
                           BEGIN(INITIAL);
                           return STRING_LITERAL;
                         }
                         else yymore();
                       }
   <<EOF>>             { yyerror("Unterminated raw string");
                         BEGIN(INITIAL);
                         return 0;
                       }
}

Some explanations

Line 1: Declare the start condition C_RAW_STRING as exclusive. (See flex manual, link above).
Line 2: dchar_seq matches what the C++ standard calls a "d-char-sequence" (see [lex.string], §5.13.5): Up to 16 characters which are anything except parentheses, backslashes or whitespace characters. (Note that a " is a d-char.) See the Flex manual chapter on patterns for details on the regular expression operators.
Line 4: Indented statements at the beginning of the rules section, before the first rule, are simply inserted at the top of the yylext function. Thus, they can be used to declare local variables used during a single invocation of yylex.
Lines 5-8: When we detect the beginning of a raw string literal, we:
- Remember how long the delimiter string is. (Since we will leave the delimiter string at the beginning of the token, we don't need to copy it anywhere. It's sufficient to know how long it is.)
- Use yymore() to tell the lexer to append the next match to the current token, instead of starting a new token.
- Use BEGIN to change to the C_RAW_STRING start condition.
Line 9: If we find R" and the pattern at line 5 didn't match, then the raw string delimiter wasn't correct, either because it was too long, or because it contained an invalid character or because it didn't end with an (. It would be OK to just match R", but I chose to match the longest possible d-char sequence as well to avoid backtracking the lexer, although in this particular case it doesn't make much difference. Because of the longest match rule, this rule will only trigger if the preceding rule doesn't. It's really hard to recover from errors like this, because there is no good way to know where the quoted string is expected to end. Here I didn't make any attempt at error recovery, since that's not the point of this answer.
Line 15: Starts a block of rules for the C_RAW_STRING start condition. This is a flex extension, not available in other lex implementations. For compatibility, all the patterns would have to be individually marked with <C_RAW_STRING>.
Line 16: Any sequence of characters other than a double quote are just appended to the token. Again, yymore() indicates that we haven't finished with the token yet.
Lines 17-24: The raw string must end with a quote, but a quote doesn't necessarily end the raw string. So we check every quote we encounter. We first make sure that there is an ) where we would expect it to be, given the length of the delimiter string, and then use memcmp to compare the two delimiter sequences. (We use memcmp rather than strcmp because the sequences we are comparing are not NUL-terminated; they are substrings of the token. strncmp would have been another possibility.)

If the delimiter string matches, we have found the end of the token and we can return it. In that case, we need to reset the start condition to INITIAL so that the next token will be scanned normally. If the string does not match, we use yymore() to tell yylex that we haven't finished with the token yet, and continue looking for the correct delimiter.
Lines 25-29: If end of input is detected inside the raw string, that means that the raw string was not correctly terminated. We issue an error message and return 0 to indicate that EOF was encountered. Resetting the start condition at this point is not really necessary but it seems cleaner to do so.

Notes

For example, you could use the following Gnu grep command (using the -P option to enable PCRE regex features), except that since it is a simple regex search rather than a tokeniser, it could produce false positives (for example, things that look like raw strings inside of comments):
```
 grep -Po '"([^()\\[:space:]]*)\(.*?\)\1"'
```

Flex & Bison : Is it possible to parse Raw string literal using Backreference in Flex&Bison?

1 Answers

Notes