I'm having a hard time figuring out how to recognize some text only if it is preceded and followed by certain things. The task is to recognize AND, OR, and NOT, but not if they're part of a word:
They should be recognized here:
x AND y
(x)AND(y)
NOT x
NOT(x)
but not here:
xANDy
abcNOTdef
AND gets recognized if it is surrounded by spaces or parentheses. NOT gets recognized if it is at the beginning of the input, preceded by a space, and followed by a space or parenthesis.
The trouble is that if I include parentheses as part of the definition of AND or NOT, they get consumed, and I need them to be separate tokens.
Is there some kind of lookahead/lookbehind syntax I can use?
EDIT:
Per the comments, here's some context. The problem is related to this problem: Antlr: how to match everything between the other recognized tokens? My working solution there is just to recognize AND, OR, etc. and skip everything else. Then, in a second pass over the text, I manually grab the characters not otherwise covered, and run a totally different tokenizer on it. The reason is that I need a custom, human-language-specific tokenizer for this content, which means that I can't, in advance, describe what is an ID. Each human language is different. I want to combine, in stages, a single query-language tokenizer, and then apply a human-language tokenizer to what's left.
xANDy
andabcNOTdef
then be tokenised? These are usually tokenised as some sort of identifier token, in which case you shouldn't have an issue. Some more context about what you're trying to parse/tokenise would really help. – Bart KiersAND
and one forID
whereID
matches your identifiers. Place the keyword rule (AND
) before theID
rule in your grammar. It will match whenand
comes in alone (e.g. surrounded by whitespaces or non-id-chars). OtherwiseID
matches and gives you any identifier (even those containing the lettersand
). – Mike Lischke