capture in a string everything that is not a token

Question

Context: I am dealing with a mix of boolean and arithmetic expressions that may look like in the following example:

b_1 /\ (0 <= x_1) /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))))

I want to match and extract any constraint of the shape A <= B contained in the formula which must be always true. In the above example, only 0 <= x_1 would satisfy such criterion.

Current Goal: My idea is to build a simple parse tree of the input formula focusing only on the following tokens: and (/\), or (\/), left bracket (() and right bracket ()). Given the above formula, I would like to generate the following AST:

/\
|_ "b_1"
|_ /\
   |_ "0 <= x_1"
   |_ \/
      |_ "x_2 <= 2"
      |_ /\
         |_ "b_3"
         |_ "(/ 1 3) <= x_4"

Then, I can simply walk through the AST and discard any sub-tree rooted at \/.

My Attempt:

Looking at this documentation, I am defining the grammar for the lexer as follows:

import ply.lex as lex

tokens = (
    "LPAREN",
    "RPAREN",
    "AND",
    "OR",
    "STRING",
)

t_AND    = r'\/\\'
t_OR     = r'\\\/'
t_LPAREN = r'\('
t_RPAREN = r'\)'

t_ignore = ' \t\n'

def t_error(t):
    print(t)
    print("Illegal character '{}'".format(t.value[0]))
    t.lexer.skip(1)

def t_STRING(t):
    r'^(?!\)|\(| |\t|\n|\\\/|\/\\)'
    t.value = t
    return t

data = "b_1 /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))"

lexer = lex.lex()

lexer.input(data)

while True:
    tok = lexer.token()
    if not tok:
        break
    print(tok.type, tok.value, tok.lineno, tok.lexpos)

However, I get the following output:

~$ python3 lex.py
LexToken(error,'b_1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,0)
Illegal character 'b'
LexToken(error,'_1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,1)
Illegal character '_'
LexToken(error,'1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,2)
Illegal character '1'
AND /\ 1 4
LPAREN ( 1 7
LexToken(error,'x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,8)
Illegal character 'x'
LexToken(error,'_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,9)
Illegal character '_'
LexToken(error,'2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,10)
Illegal character '2'
LexToken(error,'<= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,12)
Illegal character '<'
LexToken(error,'= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,13)
Illegal character '='
LexToken(error,'2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,15)
Illegal character '2'
OR \/ 1 17
LPAREN ( 1 20
LexToken(error,'b_3 /\\ ((/ 1 3) <= x_4))',1,21)
Illegal character 'b'
LexToken(error,'_3 /\\ ((/ 1 3) <= x_4))',1,22)
Illegal character '_'
LexToken(error,'3 /\\ ((/ 1 3) <= x_4))',1,23)
Illegal character '3'
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
LexToken(error,'/ 1 3) <= x_4))',1,30)
Illegal character '/'
LexToken(error,'1 3) <= x_4))',1,32)
Illegal character '1'
LexToken(error,'3) <= x_4))',1,34)
Illegal character '3'
RPAREN ) 1 35
LexToken(error,'<= x_4))',1,37)
Illegal character '<'
LexToken(error,'= x_4))',1,38)
Illegal character '='
LexToken(error,'x_4))',1,40)
Illegal character 'x'
LexToken(error,'_4))',1,41)
Illegal character '_'
LexToken(error,'4))',1,42)
Illegal character '4'
RPAREN ) 1 43
RPAREN ) 1 44

The t_STRING token is not correctly recognized as it should.

Question: how to set the catch all regular expression for t_STRING so as to get a working tokenizer?

Are we to assume x_1, x_2 and so on represent non-negative integers or floats? Is it reasonable to expect the reader to know what (/ 1 3)? means? At least one reader has no idea what it means. — Cary Swoveland
@CarySwoveland the idea is that besides /\, \/, (), all that remains are expressions that can be parsed as arbitrary strings. My understanding is that the type of x_1, x_2, etc. and the meaning of (/ 1 3) is not relevant to the question, because there is no need to parse them as anything else than elements of a generic string. — user13201049
@Cori: As I say at the end of my answer, there is no obvious way for a lexical analyser to know that some ( are supposed to be tokens, while the parentheses in (/ 1 3) are supposed to be elements of a generic string. You are really better off reducing the input to a series of tokens. But that's not the focus of the answer, which tries to explain the problem with your regular expression. — rici
@rici thank you for your answer, I'll start reading now. Regarding your comment, you are right in pointing that out. That was an oversight on my part. I gave an example of the expected AST, but not of the expected tokens, which would have avoided this ambiguity. I am fine with / 1 3 being parsed as a token on its own. I can merge it with the parent node inside yacc, I think. — user13201049

rici rici · Accepted Answer · 2020-04-02T21:16:29

Your regular expression for T_STRING most certainly doesn't do what you want. What it does do is a little more difficult to answer.

In principle, it consists only of two zero-length assertions: ^, which is only true at the beginning of the string (unless you provide the re.MULTILINE flag, which you don't), and a long negative lookahead assertion.

A pattern which consists only of zero-length assertions can only match the empty string, if it matches anything at all. But lexer patterns cannot be allowed to match the empty string. Lexers divide the input into a series of tokens, so that every character in the input belongs to some token. Each match -- and they are all matches, not searches -- starts precisely at the end of the previous match. So if a pattern could match the empty string, the lexer would try the next match at the same place, with the same result, which would be an endless loop.

Some lexer generators solve this problem by forcing a minimum one-character match using a built-in catch-all error pattern, but Ply simply refuses to generate a lexer if a pattern matches the empty string. Yet Ply does not complain about this lexer specification. The only possible explanation is that the pattern cannot match anything.

The key is that Ply compiles all patterns using the re.VERBOSE flag, which allows you to separate items in regular expressions with whitespace, making the regexes slightly less unreadable. As the Python documentation indicates:

Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>.

Whitespace includes newlines and even comments (starting with a # character), so you can split patterns over several lines and insert comments about each piece.

We could do that, in fact, with your pattern:

def t_STRING(t):
    r'''^         # Anchor this match at the beginning of the input
        (?!       # Don't match if the next characters match:
           \)   | # Close parenthesis
           \(   | # Open parenthesis
           \    | # !!! HERE IS THE PROBLEM
           \t   | # Tab character
           \n   | # Newline character
           \\\/ | # \/ token
           \/\\   # /\ token
        )
     '''
    t.value = t
    return t

So as I added whitespace and comments to your pattern, I had to notice that the original pattern attempted to match a space character as an alternative with | |. But since the pattern is compiled as re.VERBOSE, that space character is ignored, leaving an empty alternative, which matches the empty string. That alternative is part of a negative lookahead assertion, which means that the assertion will fail if the string to match at that point starts with the empty string. Of course, every string starts with the empty string, so the negative lookahead assertion always fails, explaining why Ply didn't complain (and why the pattern never matches anything).

Regardless of that particular glitch, the pattern cannot be useful because, as mentioned already, a lexer pattern must match some characters, and so a pattern which only matches the empty string cannot be useful. What we want to do is match any character, providing that the negative lookahead (corrected, as below) allows it. So that means that the negative lookahead assertion show be followed with ., which will match the next character.

But you almost certainly don't want to match just one character. Presumably you wanted to match a string of characters which don't match any other token. So that means putting the negative lookahead assertion and the following . into a repetition. And remember that it needs to be a non-empty repetition (+, not *), because patterns must not have empty matches.

Finally, there is absolutely no point using an anchor assertion, because that would limit the pattern to matching only at the beginning of the input, and that is certainly not what you want. It's not at all clear what it is doing there. (I've seen recommendations which suggest using an anchor with a negative lookahead search, which I think are generally misguided, but that discussion is out of scope for this question.)

And before we write the pattern, let's make one more adjustment: in a Python regular expression, if you can replace a set of alternatives with a character class, you should do so because it is a lot more efficient. That's true even if only some of the alternatives can be replaced.

So that produces the following:

def t_STRING(t):
    r'''(
         (?!            # Don't match if the next characters match:
            [() \t\n] |   # Parentheses or whitespace
            \\\/      |   # \/ token
            \/\\          # /\ token
         ) .            # If none of the above match, accept a character
        )+              # and repeat as many times as possible (at least once)
     '''
    return t

I removed t.value = t. t is a token object, not a string, and the value should be the string it matched. If you overwrite the value with a circular reference, you won't be able to figure out which string was matched.

This works, but not quite in the way you intended. Since whitespace characters are excluded from T_STRING, you don't get a single token representing (/ 1 3) <= x_4. Instead, you get a series of tokens:

STRING b_1 1 0
AND /\ 1 4
LPAREN ( 1 7
STRING x_2 1 8
STRING <= 1 12
STRING 2 1 15
OR \/ 1 17
LPAREN ( 1 20
STRING b_3 1 21
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
STRING / 1 30
STRING 1 1 32
STRING 3 1 34
RPAREN ) 1 35
STRING <= 1 37
STRING x_4 1 40
RPAREN ) 1 43
RPAREN ) 1 44

But I think that's reasonable. How could the lexer be able to tell that the parentheses in (x_2 <= 2 and (b_3 are parenthesis tokens, while the parentheses in (/ 1 3) <= x_4 are part of T_STRING? That determination will need to be made in your parser.

In fact, my inclination would be to fully tokenise the input, even if you don't (yet) require a complete tokenisation. As this entire question and answer shows, attempting to recognised "everything but..." can actually be a lot more complicated than just recognising all tokens. Trying to get the tokeniser to figure out which tokens are useful and which ones aren't is often more difficult than tokenising everything and passing it through a parser.

capture in a string everything that is not a token

2 Answers