Context: I am dealing with a mix of boolean and arithmetic expressions that may look like in the following example:
b_1 /\ (0 <= x_1) /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))))
I want to match and extract any constraint of the shape A <= B
contained in the formula which must be always true. In the above example, only 0 <= x_1
would satisfy such criterion.
Current Goal:
My idea is to build a simple parse tree of the input formula focusing only on the following tokens: and (/\
), or (\/
), left bracket ((
) and right bracket ()
). Given the above formula, I would like to generate the following AST:
/\
|_ "b_1"
|_ /\
|_ "0 <= x_1"
|_ \/
|_ "x_2 <= 2"
|_ /\
|_ "b_3"
|_ "(/ 1 3) <= x_4"
Then, I can simply walk through the AST and discard any sub-tree rooted at \/
.
My Attempt:
Looking at this documentation, I am defining the grammar for the lexer as follows:
import ply.lex as lex
tokens = (
"LPAREN",
"RPAREN",
"AND",
"OR",
"STRING",
)
t_AND = r'\/\\'
t_OR = r'\\\/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
t_ignore = ' \t\n'
def t_error(t):
print(t)
print("Illegal character '{}'".format(t.value[0]))
t.lexer.skip(1)
def t_STRING(t):
r'^(?!\)|\(| |\t|\n|\\\/|\/\\)'
t.value = t
return t
data = "b_1 /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))"
lexer = lex.lex()
lexer.input(data)
while True:
tok = lexer.token()
if not tok:
break
print(tok.type, tok.value, tok.lineno, tok.lexpos)
However, I get the following output:
~$ python3 lex.py
LexToken(error,'b_1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,0)
Illegal character 'b'
LexToken(error,'_1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,1)
Illegal character '_'
LexToken(error,'1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,2)
Illegal character '1'
AND /\ 1 4
LPAREN ( 1 7
LexToken(error,'x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,8)
Illegal character 'x'
LexToken(error,'_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,9)
Illegal character '_'
LexToken(error,'2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,10)
Illegal character '2'
LexToken(error,'<= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,12)
Illegal character '<'
LexToken(error,'= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,13)
Illegal character '='
LexToken(error,'2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,15)
Illegal character '2'
OR \/ 1 17
LPAREN ( 1 20
LexToken(error,'b_3 /\\ ((/ 1 3) <= x_4))',1,21)
Illegal character 'b'
LexToken(error,'_3 /\\ ((/ 1 3) <= x_4))',1,22)
Illegal character '_'
LexToken(error,'3 /\\ ((/ 1 3) <= x_4))',1,23)
Illegal character '3'
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
LexToken(error,'/ 1 3) <= x_4))',1,30)
Illegal character '/'
LexToken(error,'1 3) <= x_4))',1,32)
Illegal character '1'
LexToken(error,'3) <= x_4))',1,34)
Illegal character '3'
RPAREN ) 1 35
LexToken(error,'<= x_4))',1,37)
Illegal character '<'
LexToken(error,'= x_4))',1,38)
Illegal character '='
LexToken(error,'x_4))',1,40)
Illegal character 'x'
LexToken(error,'_4))',1,41)
Illegal character '_'
LexToken(error,'4))',1,42)
Illegal character '4'
RPAREN ) 1 43
RPAREN ) 1 44
The t_STRING
token is not correctly recognized as it should.
Question: how to set the catch all regular expression for t_STRING
so as to get a working tokenizer?
x_1
,x_2
and so on represent non-negative integers or floats? Is it reasonable to expect the reader to know what(/ 1 3)?
means? At least one reader has no idea what it means. – Cary Swoveland/\, \/, ()
, all that remains are expressions that can be parsed as arbitrary strings. My understanding is that the type ofx_1
,x_2
, etc. and the meaning of(/ 1 3)
is not relevant to the question, because there is no need to parse them as anything else than elements of a generic string. – user13201049(
are supposed to be tokens, while the parentheses in(/ 1 3)
are supposed to be elements of a generic string. You are really better off reducing the input to a series of tokens. But that's not the focus of the answer, which tries to explain the problem with your regular expression. – rici/ 1 3
being parsed as a token on its own. I can merge it with the parent node insideyacc
, I think. – user13201049