3
votes

I'm having a hard time figuring out how to recognize some text only if it is preceded and followed by certain things. The task is to recognize AND, OR, and NOT, but not if they're part of a word:

They should be recognized here:

x AND y
(x)AND(y)
NOT x
NOT(x)

but not here:

xANDy
abcNOTdef

AND gets recognized if it is surrounded by spaces or parentheses. NOT gets recognized if it is at the beginning of the input, preceded by a space, and followed by a space or parenthesis.

The trouble is that if I include parentheses as part of the definition of AND or NOT, they get consumed, and I need them to be separate tokens.

Is there some kind of lookahead/lookbehind syntax I can use?

EDIT:

Per the comments, here's some context. The problem is related to this problem: Antlr: how to match everything between the other recognized tokens? My working solution there is just to recognize AND, OR, etc. and skip everything else. Then, in a second pass over the text, I manually grab the characters not otherwise covered, and run a totally different tokenizer on it. The reason is that I need a custom, human-language-specific tokenizer for this content, which means that I can't, in advance, describe what is an ID. Each human language is different. I want to combine, in stages, a single query-language tokenizer, and then apply a human-language tokenizer to what's left.

1
How should xANDy and abcNOTdef then be tokenised? These are usually tokenised as some sort of identifier token, in which case you shouldn't have an issue. Some more context about what you're trying to parse/tokenise would really help.Bart Kiers
Bart's right. You are seeing a problem where none exists. Create a rule for AND and one for ID where ID matches your identifiers. Place the keyword rule (AND) before the ID rule in your grammar. It will match when and comes in alone (e.g. surrounded by whitespaces or non-id-chars). Otherwise ID matches and gives you any identifier (even those containing the letters and).Mike Lischke
Context added to my question.ccleve

1 Answers

0
votes

ANTLR is not the right tool for this task. A normal parser is designed for a specific language, that is, a set of sentences consisting of elements that are known at parser creation time. There are ways to make this more flexible, e.g. by using a runtime function in a predicate to recognize words not defined in the grammar, but this has other (negative) implications.

What you should consider is NLP for a different approach to process natural language. It's more than just skipping things between two known tokens.