0
votes

So i have a lexer with a token defined so that on a boolean property it is enabled/disabled

I create an input stream and parse a text. My token is called PHRASE_TEXT and should match anything falling within this pattern '"' ('\\' ~[] |~('\"'|'\\')) '"' {phraseEnabled}?

I tokenize "foo bar" and as expected I get a single token. After setting the property to false on the lexer and calling setInputStream on it with the same text I get "foo , bar" so 2 tokens instead of one. This is also expected behavior.

The problem comes when setting the property to true again. I would expect the same text to tokenize to the whole 1 token "foo bar" but instead is tokenized to the 2 tokens from before. Is this a bug on my part? What am I doing wrong here? I tried using new instances of the tokenizer and reusing the same instance but it doesn't seem to work either way. Thanks in advance.

Edit : Part of my grammar follows below

grammar LuceneQueryParser;

@header{package com.amazon.platformsearch.solr.queryparser.psclassicqueryparser;}

@lexer::members {
    public boolean phrases = true;
}

@parser::members {
    public boolean phraseQueries = true;
}

mainQ : LPAREN query RPAREN
      | query
      ;

query : not ((AND|OR)? not)* ;

andClause : AND ;
orClause  : OR ;

not : NOT? modifier? clause;

clause : qualified                        
       | unqualified                          
       ;

unqualified : LBRACK range_in LBRACK
            | LCURL range_out RCURL
            | truncated
            | {phraseQueries}? quoted
            | LPAREN query RPAREN
            | normal
            ;

truncated : TERM_TEXT_TRUNCATED;
range_in  : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
range_out : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);

qualified : TERM_TEXT COLON unqualified ;

normal : TERM_TEXT;
quoted : PHRASE_TEXT;

modifier : PLUS
         | MINUS
         ;

PHRASE_TEXT : '"' (ESCAPE|~('\"'|'\\'))+ '"' {phrases}?;
TERM_TEXT : (TERM_CHAR|ESCAPE)+;
TERM_CHAR  : ~(' ' | '\t' | '\n' | '\r' | '\u3000'
           | '\\' | '\'' | '(' | ')' | '[' | ']' | '{' | '}'
           | '+' | '-' | '!' | ':' | '~' | '^'
           | '*' | '|' | '&' | '?' );


ESCAPE : '\\' ~[];

The problem seems to be that after i set the phrases to false, and then to true again, no more tokens seem to be recognized as PHRASE_TEXT. I know that as a guideline i should define my grammars to be unambiguous but this is basically the way it has to end up looking : tokenizing a string with quotes in 2 different modes, depending on the situation.

1
I would need to see more of the grammar and the calling code in order to answer this question.Sam Harwell
You might want to look into ANTLR4's support for lexical modes, and try to trigger that switching mechanism from your code. I believe the feature was intended to support situations such as embedding PHP inside HTML.Darien

1 Answers

0
votes

I'm gonna have to update this with an answer a colleague of mine helpfully pointed out. The lexer generated class has a static DFA[] array shared between all instances of the class. Once the property was set to false instead of the default true the decision tree was apparently changed for all object instances. A fix for this was to have to separate DFA[] arrays for both the true and false instances of the property i was modifying. I think making that array not static would be too expensive and i really can't think about another fix.