2
votes

Preface

I'm trying to write my own context-free-grammar specification, to associate with the rules of my lexer/parser. It is meant to be similar to that of ANTLR's, where upper-case identifiers classify as a Lexer rule and lower-case identifiers classify as a Parser rule. It is meant to accept any combination of string literals and/or regular expressions for lexer rules, and any combination of lexer/regex rules and/or other parser identifiers for parser rules. Each rule in is the format of <identifier>:<expression>;

Here's an example of the grammar:

integer      : DIGIT+;        //parser rule containing at least one lexer rule
twodigits    : DIGIT DIGIT;   //parser rule containing two consecutive lexer rules
DIGIT        : [0-9];         //lexer rule containing regex
string       : '"' CHAR* '"'; //parser rule containing zero or more 
                              //  lexer rules, wrapped in two string literals
CHAR         : (LCHAR|UCHAR); //lexer rule containing two lexer rules which
                              //  will later evaluate to one of two tokens
LCHAR        : [a-z];         //lexer rule containing regex
UCHAR        : [A-Z];         //lexer rule containing regex
SPACE        : ' ';           //lexer rule containing string literal


Problem

The trouble I'm having is the ability to match the expression strings, since their contents tend to vary.
I have originally written:
([a-zA-Z0-9_]*)(?:\s*)(?:\:)(?:\s*)((?:\'?).*(?:\'?)(?:\;))
as the match rule, which does okay for a single string literal expression surrounded by single quotes, but I need to expand this to allow for multiple non-greedy string literals, and combined statements separated by any number of whitespace. I am not concerned with matching potential regex's within a matched expression, or even capturing segregated parts of the expression, as this is handled later on by a separate regex operation, so really I just need to validate identifiers and expressions...

All in all, I need the regex_search operation to look through the grammar's contents, using the following syntax for matches:

  • A valid identifier, starting with one or more lower or uppercase letters, optionally followed by any number of alphanumeric characters (which optionally can contain any number of underscore characters inbetween, as long as the identifier does not start or end with one).
  • Any number of whitespace characters, tabs, newlines etc, without capturing it.
  • A colon without capturing it.
  • Any number of whitespace characters, tabs, newlines etc, without capturing it.
  • At least one of: (in any order) any number of string literals (enclosed in single quotes, without capturing the quotes), any number of lexer/parser identifiers, any number of regex's (enclosed in square brackets). The result of this match rule should capture the entire expression as a single string, which will later go through a post-processing stage.
  • Any number of whitespace characters, tabs, newlines etc, without capturing it.
  • A semicolon optionally followed by any uncaptured whitespace.
  • Optionally, any number of uncaptured spaces followed by a single captured line comment
  • Any number of whitespace characters, tabs, newlines etc, without capturing it.

Question

Is it possible to place this into a single regex_search operation?
I've messed around in Expresso and just can't seem to get it right...


Update

So far, I've been able to come up with the following:

#/////////////////////
# Identifier
#/////////////////////
(
    (?:[a-zA-Z]+)           # At least one lower/uppercase letter
    (?:
        (?:[a-zA-Z0-9_]*)   # Zero or more alphanumeric/underscore characters,
        (?:\w+)             # explicitly followed by one or more alphanumeric
    )?                      #   characters
)

#/////////////////////
# Separator
#/////////////////////
(?:\s*)                     # Any amount of uncaptured whitespace
(?:\:)                      # An uncaptured colon
(?:\s*)                     # Any amount of uncaptured whitespace

#///////////////////////
# Expression
#///////////////////////
(
    # String Literals:
    (?:\'?)                 # An optional single quote,
    (?:                     #   which is meant to start and end a string
        (?:[^'\\] | \\.)*   #   literal, but issues several problems for
    )                       #   me (see comments below, after this code block)
    (?:\'?)
    # Other expressions
    # ????????????
)

#/////////////////////
# Line End
#/////////////////////
(?:\s*)                     # Any amount of uncaptured whitespace
(?:\;)                      # An uncaptured colon
(?:\s*)                     # Any amount of uncaptured whitespace

As you can see, I have identifiers, separators and line-ends working perfectly. But expressions are where I'm totally stuck!

How can I tell the regex library that I want EITHER a non-greedy string literal, OR any set of characters before the Line End, AND any number of them in any order?

Even if I only allowed a single string literal, how would I say "The closing single quote is NOT optional if the first one exists"?

1
I haven't read your full question, but if you are parsing a free-grammar with C++11 regex, then it is not possible. A stack is necessary to parse context-free grammar (.NET or Perl flavor can do this, since they have construct that emulate a stack).nhahtdh
@nhahtdh How is it not possible to parse a context-free (or any) grammar file? For each line, I'm really only trying to grab the non-terminal (identifier) and terminal (expression) strings. In theory, this is easy until you try to match any combination of string literals and other terminals as a single terminal expression.RectangleEquals
Are you lexing or parsing? Lexing is always possible, but parsing CFG with C++11 regex is not possible. (My comment earlier wrote "free-grammar", but should be read as "context-free grammar"). If your grammar is actually a regular grammar, instead of a "real" context-free grammar, then it is possible to parse with C++11 regex.nhahtdh
Both, actually. The plan is to parse the grammar to build a list of identifiers and expressions, then post-process those expressions, building string literals into tokens and replacing terminal symbols in expressions with their equivalent non-terminal and/or regex values... and this is all before parsing the target input source, in a library I've pridefully named "Parlex"RectangleEquals
Please edit it to focus on the part that you are stuck. There are too many things in your post that it is distracting.nhahtdh

1 Answers

0
votes

It might not be flawless, and may require additional coding in how match results are handled, but this appears to work:

#/////////////////////
# Identifier
#/////////////////////
(
    (?:[a-zA-Z]+)
    (?:
        (?:[a-zA-Z0-9_]*)
        (?:\w+)
    )?
)

#/////////////////////
# Separator
#/////////////////////
(?:\s*\:\s*)

#///////////////////////
# Expression
#///////////////////////
(
    '(?:\\\\.|[^'])*'|[^']+     # Might need to be processed separately
)

#/////////////////////
# Line End
#/////////////////////
(?:\s*\;\s*)